Re: CephFS no longer mounts and asserts in MDS after upgrade to 0.67.3

"Yan, Zheng" <ukernel@xxxxxxxxx> · Wed, 11 Sep 2013 21:14:07 +0800

On Wed, Sep 11, 2013 at 9:12 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Wed, Sep 11, 2013 at 7:51 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>> Hey Gregory,
>>
>> I wiped and re-created the MDS-cluster I just mailed about, starting out
>> by making sure CephFS is not mounted anywhere, stopping all MDSs,
>> completely cleaning the "data" and "metadata"-pools using "rados
>> --pool=<pool> cleanup <prefix>", then creating a new cluster using `ceph
>> mds newfs 1 0 --yes-i-really-mean-it' and starting all MDSs again.
>> Directly afterwards, I saw this:
>> # rados --pool=metadata ls
>> 1.00000000
>> 2.00000000
>> 200.00000000
>> 200.00000001
>> 600.00000000
>> 601.00000000
>> 602.00000000
>> 603.00000000
>> 605.00000000
>> 606.00000000
>> 608.00000000
>> 609.00000000
>> mds0_inotable
>> mds0_sessionmap
>>
>> Note the missing objects, right from the start.  I was able to mount the
>> CephFS at this point, but after unmounting it and restarting the
>> MDS-cluster, it failed to come up, with the same symptoms as before.  I
>> didn't place any files on CephFS at any point between newfs and failure.
>> Naturally, I tried initializing it again, but now, even after more than
>> 5 tries, the "mds*"-objects simply no longer show up in the
>> "metadata"-pool at all.  In fact, it remains empty.  I can mount CephFS
>> after the first start of the MDS-cluster after a newfs, but on restart,
>> it fails because of the missing objects.  Am I doing anything wrong
>> while initializing the cluster, maybe?  Is cleaning the pools and doing
>> the newfs enough?  I did the same on the other cluster yesterday and it
>> seems to have all objects.
>>
>
> Thank you for your default information.
>

s/default/detail

sorry for the typo.

> The cause of missing object is that the MDS IDs for old FS and new FS
> are the same (incarnations are the same). When OSD receives MDS
> requests for the newly created FS. It silently drops the requests,
> because it thinks they are duplicated.  You can get around the bug by
> creating new pools for the newfs.
>
> Regards
> Yan, Zheng
>
>>
>>    Regards,
>>
>>      Oliver
>>
>> On di, 2013-09-10 at 16:24 -0700, Gregory Farnum wrote:
>>> Nope, a repair won't change anything if scrub doesn't detect any
>>> inconsistencies. There must be something else going on, but I can't
>>> fathom what...I'll try and look through it a bit more tomorrow. :/
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Tue, Sep 10, 2013 at 3:49 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>> > Hey Gregory,
>>> >
>>> > Thanks for your explanation.  Turns out to be 1.a7 and it seems to scrub
>>> > OK.
>>> >
>>> > # ceph osd getmap -o osdmap
>>> > # osdmaptool --test-map-object mds_anchortable --pool 1 osdmap
>>> > osdmaptool: osdmap file 'osdmap'
>>> >  object 'mds_anchortable' -> 1.a7 -> [2,0]
>>> > # ceph pg scrub 1.a7
>>> >
>>> > osd.2 logs:
>>> > 2013-09-11 00:41:15.843302 7faf56b1b700  0 log [INF] : 1.a7 scrub ok
>>> >
>>> > osd.0 didn't show anything in it's logs, though.  Should I try a repair
>>> > next?
>>> >
>>> >
>>> >    Regards,
>>> >
>>> >       Oliver
>>> >
>>> > On di, 2013-09-10 at 15:01 -0700, Gregory Farnum wrote:
>>> >> If the problem is somewhere in RADOS/xfs/whatever, then there's a good
>>> >> chance that the "mds_anchortable" object exists in its replica OSDs,
>>> >> but when listing objects those aren't queried, so they won't show up
>>> >> in a listing. You can use the osdmaptool to map from an object name to
>>> >> the PG it would show up in, or if you look at your log you should see
>>> >> a line something like
>>> >> 1 -- <LOCAL IP> --> <OTHER IP> -- osd_op(mds.0.31:3 mds_anchortable
>>> >> [read 0~0] 1.a977f6a7 e165) v4 -- ?+0 0x1e88d80 con 0x1f189a0
>>> >> In this example, metadata is pool 1 and 1.a977f6a7 is the hash of the
>>> >> msd_anchortable object, and depending on how many PGs are in the pool
>>> >> it will be in pg 1.a7, or 1.6a7, or 1.f6a7...
>>> >> -Greg
>>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> >>
>>> >> On Tue, Sep 10, 2013 at 2:51 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
>>> >> > Hey Gregory,
>>> >> >
>>> >> > The only objects containing "table" I can find at all, are in the
>>> >> > "metadata"-pool:
>>> >> > # rados --pool=metadata ls | grep -i table
>>> >> > mds0_inotable
>>> >> >
>>> >> > Looking at another cluster where I use CephFS, there is indeed an object
>>> >> > named "mds_anchortable", but the broken cluster is missing it.  I don't
>>> >> > see how I can scrub the PG for an object that doesn't appear to exist.
>>> >> > Please elaborate.
>>> >> >
>>> >> >
>>> >> >    Regards,
>>> >> >
>>> >> >      Oliver
>>> >> >
>>> >> > On di, 2013-09-10 at 14:06 -0700, Gregory Farnum wrote:
>>> >> >> Also, can you scrub the PG which contains the "mds_anchortable" object
>>> >> >> and see if anything comes up? You should be able to find the key from
>>> >> >> the logs (in the osd_op line that contains "mds_anchortable") and
>>> >> >> convert that into the PG. Or you can just scrub all of osd 2.
>>> >> >> -Greg
>>> >> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Sep 10, 2013 at 1:59 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>> >> >> > It's not an upgrade issue. There's an MDS object that is somehow
>>> >> >> > missing. If it exists, then on restart you'll be fine.
>>> >> >> >
>>> >> >> > Oliver, what is your general cluster config? What filesystem are your
>>> >> >> > OSDs running on? What version of Ceph were you upgrading from? There's
>>> >> >> > really no way for this file to not exist once created unless the
>>> >> >> > underlying FS ate it or the last write both was interrupted and hit
>>> >> >> > some kind of bug in our transaction code (of which none are known)
>>> >> >> > during replay.
>>> >> >> > -Greg
>>> >> >> > Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Sep 10, 2013 at 1:44 PM, Liu, Larry <Larry.Liu@xxxxxxxxxx> wrote:
>>> >> >> >> This is scary. Should I hold on upgrade?
>>> >> >> >>
>>> >> >> >> On 9/10/13 11:33 AM, "Oliver Daudey" <oliver@xxxxxxxxx> wrote:
>>> >> >> >>
>>> >> >> >>>Hey Gregory,
>>> >> >> >>>
>>> >> >> >>>On 10-09-13 20:21, Gregory Farnum wrote:
>>> >> >> >>>> On Tue, Sep 10, 2013 at 10:54 AM, Oliver Daudey <oliver@xxxxxxxxx>
>>> >> >> >>>>wrote:
>>> >> >> >>>>> Hey list,
>>> >> >> >>>>>
>>> >> >> >>>>> I just upgraded to Ceph 0.67.3.  What I did on every node of my 3-node
>>> >> >> >>>>> cluster was:
>>> >> >> >>>>> - Unmount CephFS everywhere.
>>> >> >> >>>>> - Upgrade the Ceph-packages.
>>> >> >> >>>>> - Restart MON.
>>> >> >> >>>>> - Restart OSD.
>>> >> >> >>>>> - Restart MDS.
>>> >> >> >>>>>
>>> >> >> >>>>> As soon as I got to the second node, the MDS crashed right after
>>> >> >> >>>>>startup.
>>> >> >> >>>>>
>>> >> >> >>>>> Part of the logs (more on request):
>>> >> >> >>>>>
>>> >> >> >>>>> -> 194.109.43.12:6802/53419 -- osd_op(mds.0.58:4 mds_snaptable [read
>>> >> >> >>>>> 0~0] 1.d902
>>> >> >> >>>>> 70ad e37647) v4 -- ?+0 0x1e48d80 con 0x1e5d9a0
>>> >> >> >>>>>    -11> 2013-09-10 19:35:02.798962 7fd1ba81f700  2 mds.0.58 boot_start
>>> >> >> >>>>> 1: openin
>>> >> >> >>>>> g mds log
>>> >> >> >>>>>    -10> 2013-09-10 19:35:02.798968 7fd1ba81f700  5 mds.0.log open
>>> >> >> >>>>> discovering lo
>>> >> >> >>>>> g bounds
>>> >> >> >>>>>     -9> 2013-09-10 19:35:02.798988 7fd1ba81f700  1 mds.0.journaler(ro)
>>> >> >> >>>>> recover s
>>> >> >> >>>>> tart
>>> >> >> >>>>>     -8> 2013-09-10 19:35:02.798990 7fd1ba81f700  1 mds.0.journaler(ro)
>>> >> >> >>>>> read_head
>>> >> >> >>>>>     -7> 2013-09-10 19:35:02.799028 7fd1ba81f700  1 --
>>> >> >> >>>>> 194.109.43.12:6800/67277 -
>>> >> >> >>>>> -> 194.109.43.11:6800/16562 -- osd_op(mds.0.58:5 200.00000000 [read
>>> >> >> >>>>>0~0]
>>> >> >> >>>>> 1.844f3
>>> >> >> >>>>> 494 e37647) v4 -- ?+0 0x1e48b40 con 0x1e5db00
>>> >> >> >>>>>     -6> 2013-09-10 19:35:02.799053 7fd1ba81f700  1 --
>>> >> >> >>>>> 194.109.43.12:6800/67277 <
>>> >> >> >>>>> == mon.2 194.109.43.13:6789/0 16 ==== mon_subscribe_ack(300s) v1 ====
>>> >> >> >>>>> 20+0+0 (42
>>> >> >> >>>>> 35168662 0 0) 0x1e93380 con 0x1e5d580
>>> >> >> >>>>>     -5> 2013-09-10 19:35:02.799099 7fd1ba81f700 10 monclient:
>>> >> >> >>>>> handle_subscribe_a
>>> >> >> >>>>> ck sent 2013-09-10 19:35:02.796448 renew after 2013-09-10
>>> >> >> >>>>>19:37:32.796448
>>> >> >> >>>>>     -4> 2013-09-10 19:35:02.800907 7fd1ba81f700  5 mds.0.58
>>> >> >> >>>>> ms_handle_connect on
>>> >> >> >>>>>  194.109.43.12:6802/53419
>>> >> >> >>>>>     -3> 2013-09-10 19:35:02.800927 7fd1ba81f700  5 mds.0.58
>>> >> >> >>>>> ms_handle_connect on
>>> >> >> >>>>>  194.109.43.13:6802/45791
>>> >> >> >>>>>     -2> 2013-09-10 19:35:02.801176 7fd1ba81f700  5 mds.0.58
>>> >> >> >>>>> ms_handle_connect on
>>> >> >> >>>>>  194.109.43.11:6800/16562
>>> >> >> >>>>>     -1> 2013-09-10 19:35:02.803546 7fd1ba81f700  1 --
>>> >> >> >>>>> 194.109.43.12:6800/67277 <
>>> >> >> >>>>> == osd.2 194.109.43.13:6802/45791 1 ==== osd_op_reply(3 mds_anchortable
>>> >> >> >>>>> [read 0~
>>> >> >> >>>>> 0] ack = -2 (No such file or directory)) v4 ==== 114+0+0 (3107677671 0
>>> >> >> >>>>> 0) 0x1e4d
>>> >> >> >>>>> e00 con 0x1e5ddc0
>>> >> >> >>>>>      0> 2013-09-10 19:35:02.805611 7fd1ba81f700 -1 mds/MDSTable.cc: In
>>> >> >> >>>>> function
>>> >> >> >>>>> 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)' thread
>>> >> >> >>>>> 7fd1ba81f700 ti
>>> >> >> >>>>> me 2013-09-10 19:35:02.803673
>>> >> >> >>>>> mds/MDSTable.cc: 152: FAILED assert(r >= 0)
>>> >> >> >>>>>
>>> >> >> >>>>>  ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
>>> >> >> >>>>>  1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x44f)
>>> >> >> >>>>>[0x77ce7f]
>>> >> >> >>>>>  2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe3b) [0x7d891b]
>>> >> >> >>>>>  3: (MDS::handle_core_message(Message*)+0x987) [0x56f527]
>>> >> >> >>>>>  4: (MDS::_dispatch(Message*)+0x2f) [0x56f5ef]
>>> >> >> >>>>>  5: (MDS::ms_dispatch(Message*)+0x19b) [0x5710bb]
>>> >> >> >>>>>  6: (DispatchQueue::entry()+0x592) [0x92e432]
>>> >> >> >>>>>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x8a59bd]
>>> >> >> >>>>>  8: (()+0x68ca) [0x7fd1bed298ca]
>>> >> >> >>>>>  9: (clone()+0x6d) [0x7fd1bda5cb6d]
>>> >> >> >>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> >> >> >>>>> needed to interpret this.
>>> >> >> >>>>>
>>> >> >> >>>>> When trying to mount CephFS, it just hangs now.  Sometimes, an MDS
>>> >> >> >>>>>stays
>>> >> >> >>>>> up for a while, but will eventually crash again.  This CephFS was
>>> >> >> >>>>> created on 0.67 and I haven't done anything but mount and use it under
>>> >> >> >>>>> very light load in the mean time.
>>> >> >> >>>>>
>>> >> >> >>>>> Any ideas, or if you need more info, let me know.  It would be nice to
>>> >> >> >>>>> get my data back, but I have backups too.
>>> >> >> >>>>
>>> >> >> >>>> Does the filesystem have any data in it? Every time we've seen this
>>> >> >> >>>> error it's been on an empty cluster which had some weird issue with
>>> >> >> >>>> startup.
>>> >> >> >>>
>>> >> >> >>>This one certainly had some data on it, yes.  A couple of 100's of GBs
>>> >> >> >>>of disk-images and a couple of trees of smaller files.  Most of them
>>> >> >> >>>accessed very rarely since being copied on.
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>   Regards,
>>> >> >> >>>
>>> >> >> >>>      Oliver
>>> >> >> >>>_______________________________________________
>>> >> >> >>>ceph-users mailing list
>>> >> >> >>>ceph-users@xxxxxxxxxxxxxx
>>> >> >> >>>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >
>>> >
>>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com