Re: CephFS no longer mounts and asserts in MDS after upgrade to 0.67.3

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 10 Sep 2013 14:48:01 -0700

On Tue, Sep 10, 2013 at 2:36 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> Hey Gregory,
>
> My cluster consists of 3 nodes, each running 1 mon, 1 osd and 1 mds.  I
> upgraded from 0.67, but was still running 0.61.7 OSDs at the time of the
> upgrade, because of performance-issues that have just recently been
> fixed.  These have now been upgraded to 0.67.3, along with the rest of
> Ceph.  My OSDs are using XFS as the underlying FS.  I have been
> switching one OSD in my cluster back and forth between 0.61.7 and some
> test-versions, which where based on 0.67.x, to debug aforementioned
> performance-issues with Samuel, but that was before I newfs'ed and
> started using this instance of CephFS.  Furthermore, I don't seem to
> have lost any other data during these tests.

Sam, any idea how we could have lost an object? I checked into how we
touch this one, and all we ever do is read_full and write_full.

>
> BTW: CephFS has never been very stable for me during stress-tests.  If
> some components are brought down and back up again during operations,
> like stopping and restarting all components on one node while generating
> some load with a cp of a big CephFS directory-tree on another, then,
> once things settle again, doing the same on another node, it always
> quickly ends up like what I see now.

Do you have multiple active MDSes? Or do you just mean when you do a
reboot while generating load it migrates?

> MDSs crashing on start or on
> attempts to mount the CephFS and the only way out being to stop the
> MDSs, wipe the contents of the "data" and "metadata"-pools and doing the
> newfs-thing.  I can only assume you guys are putting it through similar
> stress-tests, but if not, try it.

Yeah, we have a bunch of these. I'm not sure that we coordinate
killing an entire node at a time, but I can't think of any way that
would matter. :/

> PS: Is there a way to get back at the data after something like this?
> Do you still want me to keep the current situation to debug it further,
> or can I zap everything, restore my backups and move on?  Thanks!

You could figure out how to build a fake anchortable (just generate an
empty one with ceph-dencoder) and that would let you do most stuff,
although if you have any hard links then those would be lost and I'm
not sure exactly what that would mean at this point — it's possible
with the new lookup-by-ino stuff that it wouldn't matter at all, or it
might make them inaccessible from one link and un-deletable when
removed from the other. (via the FS, that is.) If restoring from
backups is feasible I'd probably just shoot for that after doing a
scrub. (If the scrub turns up something dirty then probably it can be
recovered via a RADOS repair.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>    Regards,
>
>       Oliver
>
> On di, 2013-09-10 at 13:59 -0700, Gregory Farnum wrote:
>> It's not an upgrade issue. There's an MDS object that is somehow
>> missing. If it exists, then on restart you'll be fine.
>>
>> Oliver, what is your general cluster config? What filesystem are your
>> OSDs running on? What version of Ceph were you upgrading from? There's
>> really no way for this file to not exist once created unless the
>> underlying FS ate it or the last write both was interrupted and hit
>> some kind of bug in our transaction code (of which none are known)
>> during replay.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Tue, Sep 10, 2013 at 1:44 PM, Liu, Larry <Larry.Liu@xxxxxxxxxx> wrote:
>> > This is scary. Should I hold on upgrade?
>> >
>> > On 9/10/13 11:33 AM, "Oliver Daudey" <oliver@xxxxxxxxx> wrote:
>> >
>> >>Hey Gregory,
>> >>
>> >>On 10-09-13 20:21, Gregory Farnum wrote:
>> >>> On Tue, Sep 10, 2013 at 10:54 AM, Oliver Daudey <oliver@xxxxxxxxx>
>> >>>wrote:
>> >>>> Hey list,
>> >>>>
>> >>>> I just upgraded to Ceph 0.67.3.  What I did on every node of my 3-node
>> >>>> cluster was:
>> >>>> - Unmount CephFS everywhere.
>> >>>> - Upgrade the Ceph-packages.
>> >>>> - Restart MON.
>> >>>> - Restart OSD.
>> >>>> - Restart MDS.
>> >>>>
>> >>>> As soon as I got to the second node, the MDS crashed right after
>> >>>>startup.
>> >>>>
>> >>>> Part of the logs (more on request):
>> >>>>
>> >>>> -> 194.109.43.12:6802/53419 -- osd_op(mds.0.58:4 mds_snaptable [read
>> >>>> 0~0] 1.d902
>> >>>> 70ad e37647) v4 -- ?+0 0x1e48d80 con 0x1e5d9a0
>> >>>>    -11> 2013-09-10 19:35:02.798962 7fd1ba81f700  2 mds.0.58 boot_start
>> >>>> 1: openin
>> >>>> g mds log
>> >>>>    -10> 2013-09-10 19:35:02.798968 7fd1ba81f700  5 mds.0.log open
>> >>>> discovering lo
>> >>>> g bounds
>> >>>>     -9> 2013-09-10 19:35:02.798988 7fd1ba81f700  1 mds.0.journaler(ro)
>> >>>> recover s
>> >>>> tart
>> >>>>     -8> 2013-09-10 19:35:02.798990 7fd1ba81f700  1 mds.0.journaler(ro)
>> >>>> read_head
>> >>>>     -7> 2013-09-10 19:35:02.799028 7fd1ba81f700  1 --
>> >>>> 194.109.43.12:6800/67277 -
>> >>>> -> 194.109.43.11:6800/16562 -- osd_op(mds.0.58:5 200.00000000 [read
>> >>>>0~0]
>> >>>> 1.844f3
>> >>>> 494 e37647) v4 -- ?+0 0x1e48b40 con 0x1e5db00
>> >>>>     -6> 2013-09-10 19:35:02.799053 7fd1ba81f700  1 --
>> >>>> 194.109.43.12:6800/67277 <
>> >>>> == mon.2 194.109.43.13:6789/0 16 ==== mon_subscribe_ack(300s) v1 ====
>> >>>> 20+0+0 (42
>> >>>> 35168662 0 0) 0x1e93380 con 0x1e5d580
>> >>>>     -5> 2013-09-10 19:35:02.799099 7fd1ba81f700 10 monclient:
>> >>>> handle_subscribe_a
>> >>>> ck sent 2013-09-10 19:35:02.796448 renew after 2013-09-10
>> >>>>19:37:32.796448
>> >>>>     -4> 2013-09-10 19:35:02.800907 7fd1ba81f700  5 mds.0.58
>> >>>> ms_handle_connect on
>> >>>>  194.109.43.12:6802/53419
>> >>>>     -3> 2013-09-10 19:35:02.800927 7fd1ba81f700  5 mds.0.58
>> >>>> ms_handle_connect on
>> >>>>  194.109.43.13:6802/45791
>> >>>>     -2> 2013-09-10 19:35:02.801176 7fd1ba81f700  5 mds.0.58
>> >>>> ms_handle_connect on
>> >>>>  194.109.43.11:6800/16562
>> >>>>     -1> 2013-09-10 19:35:02.803546 7fd1ba81f700  1 --
>> >>>> 194.109.43.12:6800/67277 <
>> >>>> == osd.2 194.109.43.13:6802/45791 1 ==== osd_op_reply(3 mds_anchortable
>> >>>> [read 0~
>> >>>> 0] ack = -2 (No such file or directory)) v4 ==== 114+0+0 (3107677671 0
>> >>>> 0) 0x1e4d
>> >>>> e00 con 0x1e5ddc0
>> >>>>      0> 2013-09-10 19:35:02.805611 7fd1ba81f700 -1 mds/MDSTable.cc: In
>> >>>> function
>> >>>> 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)' thread
>> >>>> 7fd1ba81f700 ti
>> >>>> me 2013-09-10 19:35:02.803673
>> >>>> mds/MDSTable.cc: 152: FAILED assert(r >= 0)
>> >>>>
>> >>>>  ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
>> >>>>  1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x44f)
>> >>>>[0x77ce7f]
>> >>>>  2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe3b) [0x7d891b]
>> >>>>  3: (MDS::handle_core_message(Message*)+0x987) [0x56f527]
>> >>>>  4: (MDS::_dispatch(Message*)+0x2f) [0x56f5ef]
>> >>>>  5: (MDS::ms_dispatch(Message*)+0x19b) [0x5710bb]
>> >>>>  6: (DispatchQueue::entry()+0x592) [0x92e432]
>> >>>>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x8a59bd]
>> >>>>  8: (()+0x68ca) [0x7fd1bed298ca]
>> >>>>  9: (clone()+0x6d) [0x7fd1bda5cb6d]
>> >>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >>>> needed to interpret this.
>> >>>>
>> >>>> When trying to mount CephFS, it just hangs now.  Sometimes, an MDS
>> >>>>stays
>> >>>> up for a while, but will eventually crash again.  This CephFS was
>> >>>> created on 0.67 and I haven't done anything but mount and use it under
>> >>>> very light load in the mean time.
>> >>>>
>> >>>> Any ideas, or if you need more info, let me know.  It would be nice to
>> >>>> get my data back, but I have backups too.
>> >>>
>> >>> Does the filesystem have any data in it? Every time we've seen this
>> >>> error it's been on an empty cluster which had some weird issue with
>> >>> startup.
>> >>
>> >>This one certainly had some data on it, yes.  A couple of 100's of GBs
>> >>of disk-images and a couple of trees of smaller files.  Most of them
>> >>accessed very rarely since being copied on.
>> >>
>> >>
>> >>   Regards,
>> >>
>> >>      Oliver
>> >>_______________________________________________
>> >>ceph-users mailing list
>> >>ceph-users@xxxxxxxxxxxxxx
>> >>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com