Re: CephFS no longer mounts and asserts in MDS after upgrade to 0.67.3

Oliver Daudey <oliver@xxxxxxxxx> · Wed, 11 Sep 2013 12:44:46 +0200

Hey Gregory,

FYI: I just attempted to upgrade a second cluster where CephFS was in
use and got this:
   -24> 2013-09-11 12:00:36.674469 7f0de1438700  1 --
194.109.43.76:6800/8335 --> 194.109.43.73:6789/0 --
mon_subscribe({mdsmap=525898+,monmap=26+,osdmap=529384}) v2 -- ?+0
0x35b0700 con 0x35a9580
   -23> 2013-09-11 12:00:36.674487 7f0de1438700  1 --
194.109.43.76:6800/8335 mark_down 194.109.43.72:6800/2439 -- pipe dne
   -22> 2013-09-11 12:00:36.674492 7f0de1438700  5 mds.0.28
handle_mds_failure for myself; not doing anything
   -21> 2013-09-11 12:00:36.677244 7f0de1438700  1 --
194.109.43.76:6800/8335 <== mon.2 194.109.43.73:6789/0 25 ====
osd_map(529384..529385 src has 528815..529385) v3 ==== 592+0+0
(3556993101 0 0) 0x35d9240 con 0x35a9580
   -20> 2013-09-11 12:00:36.677334 7f0de1438700  2 mds.0.28 boot_start
1: opening inotable
   -19> 2013-09-11 12:00:36.677516 7f0de1438700  1 --
194.109.43.76:6800/8335 --> 194.109.43.75:6802/16460 --
osd_op(mds.0.28:1 mds0_inotable [read 0~0] 1.b852b893 e529385) v4 -- ?+0
0x35d96c0 con 0x35a99a0
   -18> 2013-09-11 12:00:36.677543 7f0de1438700  2 mds.0.28 boot_start
1: opening sessionmap
   -17> 2013-09-11 12:00:36.677717 7f0de1438700  1 --
194.109.43.76:6800/8335 --> 194.109.43.72:6802/25951 --
osd_op(mds.0.28:2 mds0_sessionmap [read 0~0] 1.3270c60b e529385) v4
-- ?+0 0x35d9480 con 0x35a9dc0
   -16> 2013-09-11 12:00:36.677734 7f0de1438700  2 mds.0.28 boot_start
1: opening anchor table
   -15> 2013-09-11 12:00:36.677749 7f0de1438700  1 --
194.109.43.76:6800/8335 --> 194.109.43.75:6802/16460 --
osd_op(mds.0.28:3 mds_anchortable [read 0~0] 1.a977f6a7 e529385) v4
-- ?+0 0x35d9000 con 0x35a99a0
   -14> 2013-09-11 12:00:36.677759 7f0de1438700  2 mds.0.28 boot_start
1: opening snap table
   -13> 2013-09-11 12:00:36.677912 7f0de1438700  1 --
194.109.43.76:6800/8335 --> 194.109.43.76:6804/27840 --
osd_op(mds.0.28:4 mds_snaptable [read 0~0] 1.d90270ad e529385) v4 -- ?+0
0x358ad80 con 0x35a9c60
   -12> 2013-09-11 12:00:36.677938 7f0de1438700  2 mds.0.28 boot_start
1: opening mds log
   -11> 2013-09-11 12:00:36.677966 7f0de1438700  5 mds.0.log open
discovering log bounds
   -10> 2013-09-11 12:00:36.677984 7f0de1438700  1 mds.0.journaler(ro)
recover start
    -9> 2013-09-11 12:00:36.677993 7f0de1438700  1 mds.0.journaler(ro)
read_head
    -8> 2013-09-11 12:00:36.678080 7f0de1438700  1 --
194.109.43.76:6800/8335 --> 194.109.43.71:6800/18625 --
osd_op(mds.0.28:5 200.00000000 [read 0~0] 1.844f3494 e529385) v4 -- ?+0
0x358a6c0 con 0x35ee420
    -7> 2013-09-11 12:00:36.678111 7f0de1438700  1 --
194.109.43.76:6800/8335 <== mon.2 194.109.43.73:6789/0 26 ====
mon_subscribe_ack(300s) v1 ==== 20+0+0 (3657348766 0 0) 0x35b0540 con
0x35a9580
    -6> 2013-09-11 12:00:36.678122 7f0de1438700 10 monclient:
handle_subscribe_ack sent 2013-09-11 12:00:36.674463 renew after
2013-09-11 12:03:06.674463
    -5> 2013-09-11 12:00:36.678903 7f0de1438700  5 mds.0.28
ms_handle_connect on 194.109.43.72:6802/25951
    -4> 2013-09-11 12:00:36.679271 7f0de1438700  5 mds.0.28
ms_handle_connect on 194.109.43.71:6800/18625
    -3> 2013-09-11 12:00:36.679771 7f0de1438700  5 mds.0.28
ms_handle_connect on 194.109.43.75:6802/16460
    -2> 2013-09-11 12:00:36.680424 7f0de1438700  5 mds.0.28
ms_handle_connect on 194.109.43.76:6804/27840
    -1> 2013-09-11 12:00:36.697679 7f0de1438700  1 --
194.109.43.76:6800/8335 <== osd.5 194.109.43.76:6804/27840 1 ====
osd_op_reply(4 mds_snaptable [read 0~0] ack = -2 (No such file or
directory)) v4 ==== 112+0+0 (79439328 0 0) 0x35d0400 con 0x35a9c60
     0> 2013-09-11 12:00:36.699682 7f0de1438700 -1 mds/MDSTable.cc: In
function 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)'
thread 7f0de1438700 time 2013-09-11 12:00:36.697748
mds/MDSTable.cc: 152: FAILED assert(r >= 0)

 ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
 1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x44f)
[0x77ce7f]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe3b) [0x7d891b]
 3: (MDS::handle_core_message(Message*)+0x987) [0x56f527]
 4: (MDS::_dispatch(Message*)+0x2f) [0x56f5ef]
 5: (MDS::ms_dispatch(Message*)+0x19b) [0x5710bb]
 6: (DispatchQueue::entry()+0x592) [0x92e432]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x8a59bd]
 8: (()+0x68ca) [0x7f0de59428ca]
 9: (clone()+0x6d) [0x7f0de4675b6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

So, this time it was missing "mds_snaptable".  I know it was missing
this object before the upgrade and restart, because I checked those
objects on this cluster yesterday, as a comparison against the cluster
that we tried to debug then, and it was missing then too.  So, the
object must have gone missing earlier and only now, on restart, the
problem shows itself.  This CephFS was created on 0.67.2 and I haven't
done anything noteworthy to the rest of the cluster in the mean time.
There have been 2 node-failures without data-loss recently, though.  So,
again, restart of some or all services on several nodes has happened
before the MDS-cluster started missing objects.

This cluster has 6 nodes running all services each and is set to keep 3
copies of the data in all pools.  The MDS is configured for
active/standby, just like yesterday's cluster.  Symptoms appear to be
the same.

I scrubbed the PG for this object and, not really surprisingly, it turns
out to be OK:
# ceph osd getmap -o osdmap
got osdmap epoch 529423
# osdmaptool --test-map-object mds_snaptable --pool 1 osdmap
osdmaptool: osdmap file 'osdmap'
 object 'mds_snaptable' -> 1.ad -> [5,1,3]
# ceph pg scrub 1.ad
instructing pg 1.ad on osd.5 to scrub

osd.5 says:
2013-09-11 12:15:24.946398 7f9bb8569700  0 log [INF] : 1.ad scrub ok

Going to do the zap+restore backups thing again and repeat some
stress-tests on my test-cluster, to see if I can reproduce it for you.

   Regards,

      Oliver

On di, 2013-09-10 at 16:24 -0700, Gregory Farnum wrote:
> Nope, a repair won't change anything if scrub doesn't detect any
> inconsistencies. There must be something else going on, but I can't
> fathom what...I'll try and look through it a bit more tomorrow. :/
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Tue, Sep 10, 2013 at 3:49 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> > Hey Gregory,
> >
> > Thanks for your explanation.  Turns out to be 1.a7 and it seems to scrub
> > OK.
> >
> > # ceph osd getmap -o osdmap
> > # osdmaptool --test-map-object mds_anchortable --pool 1 osdmap
> > osdmaptool: osdmap file 'osdmap'
> >  object 'mds_anchortable' -> 1.a7 -> [2,0]
> > # ceph pg scrub 1.a7
> >
> > osd.2 logs:
> > 2013-09-11 00:41:15.843302 7faf56b1b700  0 log [INF] : 1.a7 scrub ok
> >
> > osd.0 didn't show anything in it's logs, though.  Should I try a repair
> > next?
> >
> >
> >    Regards,
> >
> >       Oliver
> >
> > On di, 2013-09-10 at 15:01 -0700, Gregory Farnum wrote:
> >> If the problem is somewhere in RADOS/xfs/whatever, then there's a good
> >> chance that the "mds_anchortable" object exists in its replica OSDs,
> >> but when listing objects those aren't queried, so they won't show up
> >> in a listing. You can use the osdmaptool to map from an object name to
> >> the PG it would show up in, or if you look at your log you should see
> >> a line something like
> >> 1 -- <LOCAL IP> --> <OTHER IP> -- osd_op(mds.0.31:3 mds_anchortable
> >> [read 0~0] 1.a977f6a7 e165) v4 -- ?+0 0x1e88d80 con 0x1f189a0
> >> In this example, metadata is pool 1 and 1.a977f6a7 is the hash of the
> >> msd_anchortable object, and depending on how many PGs are in the pool
> >> it will be in pg 1.a7, or 1.6a7, or 1.f6a7...
> >> -Greg
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>
> >> On Tue, Sep 10, 2013 at 2:51 PM, Oliver Daudey <oliver@xxxxxxxxx> wrote:
> >> > Hey Gregory,
> >> >
> >> > The only objects containing "table" I can find at all, are in the
> >> > "metadata"-pool:
> >> > # rados --pool=metadata ls | grep -i table
> >> > mds0_inotable
> >> >
> >> > Looking at another cluster where I use CephFS, there is indeed an object
> >> > named "mds_anchortable", but the broken cluster is missing it.  I don't
> >> > see how I can scrub the PG for an object that doesn't appear to exist.
> >> > Please elaborate.
> >> >
> >> >
> >> >    Regards,
> >> >
> >> >      Oliver
> >> >
> >> > On di, 2013-09-10 at 14:06 -0700, Gregory Farnum wrote:
> >> >> Also, can you scrub the PG which contains the "mds_anchortable" object
> >> >> and see if anything comes up? You should be able to find the key from
> >> >> the logs (in the osd_op line that contains "mds_anchortable") and
> >> >> convert that into the PG. Or you can just scrub all of osd 2.
> >> >> -Greg
> >> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >> >>
> >> >>
> >> >> On Tue, Sep 10, 2013 at 1:59 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> >> >> > It's not an upgrade issue. There's an MDS object that is somehow
> >> >> > missing. If it exists, then on restart you'll be fine.
> >> >> >
> >> >> > Oliver, what is your general cluster config? What filesystem are your
> >> >> > OSDs running on? What version of Ceph were you upgrading from? There's
> >> >> > really no way for this file to not exist once created unless the
> >> >> > underlying FS ate it or the last write both was interrupted and hit
> >> >> > some kind of bug in our transaction code (of which none are known)
> >> >> > during replay.
> >> >> > -Greg
> >> >> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> >> >> >
> >> >> >
> >> >> > On Tue, Sep 10, 2013 at 1:44 PM, Liu, Larry <Larry.Liu@xxxxxxxxxx> wrote:
> >> >> >> This is scary. Should I hold on upgrade?
> >> >> >>
> >> >> >> On 9/10/13 11:33 AM, "Oliver Daudey" <oliver@xxxxxxxxx> wrote:
> >> >> >>
> >> >> >>>Hey Gregory,
> >> >> >>>
> >> >> >>>On 10-09-13 20:21, Gregory Farnum wrote:
> >> >> >>>> On Tue, Sep 10, 2013 at 10:54 AM, Oliver Daudey <oliver@xxxxxxxxx>
> >> >> >>>>wrote:
> >> >> >>>>> Hey list,
> >> >> >>>>>
> >> >> >>>>> I just upgraded to Ceph 0.67.3.  What I did on every node of my 3-node
> >> >> >>>>> cluster was:
> >> >> >>>>> - Unmount CephFS everywhere.
> >> >> >>>>> - Upgrade the Ceph-packages.
> >> >> >>>>> - Restart MON.
> >> >> >>>>> - Restart OSD.
> >> >> >>>>> - Restart MDS.
> >> >> >>>>>
> >> >> >>>>> As soon as I got to the second node, the MDS crashed right after
> >> >> >>>>>startup.
> >> >> >>>>>
> >> >> >>>>> Part of the logs (more on request):
> >> >> >>>>>
> >> >> >>>>> -> 194.109.43.12:6802/53419 -- osd_op(mds.0.58:4 mds_snaptable [read
> >> >> >>>>> 0~0] 1.d902
> >> >> >>>>> 70ad e37647) v4 -- ?+0 0x1e48d80 con 0x1e5d9a0
> >> >> >>>>>    -11> 2013-09-10 19:35:02.798962 7fd1ba81f700  2 mds.0.58 boot_start
> >> >> >>>>> 1: openin
> >> >> >>>>> g mds log
> >> >> >>>>>    -10> 2013-09-10 19:35:02.798968 7fd1ba81f700  5 mds.0.log open
> >> >> >>>>> discovering lo
> >> >> >>>>> g bounds
> >> >> >>>>>     -9> 2013-09-10 19:35:02.798988 7fd1ba81f700  1 mds.0.journaler(ro)
> >> >> >>>>> recover s
> >> >> >>>>> tart
> >> >> >>>>>     -8> 2013-09-10 19:35:02.798990 7fd1ba81f700  1 mds.0.journaler(ro)
> >> >> >>>>> read_head
> >> >> >>>>>     -7> 2013-09-10 19:35:02.799028 7fd1ba81f700  1 --
> >> >> >>>>> 194.109.43.12:6800/67277 -
> >> >> >>>>> -> 194.109.43.11:6800/16562 -- osd_op(mds.0.58:5 200.00000000 [read
> >> >> >>>>>0~0]
> >> >> >>>>> 1.844f3
> >> >> >>>>> 494 e37647) v4 -- ?+0 0x1e48b40 con 0x1e5db00
> >> >> >>>>>     -6> 2013-09-10 19:35:02.799053 7fd1ba81f700  1 --
> >> >> >>>>> 194.109.43.12:6800/67277 <
> >> >> >>>>> == mon.2 194.109.43.13:6789/0 16 ==== mon_subscribe_ack(300s) v1 ====
> >> >> >>>>> 20+0+0 (42
> >> >> >>>>> 35168662 0 0) 0x1e93380 con 0x1e5d580
> >> >> >>>>>     -5> 2013-09-10 19:35:02.799099 7fd1ba81f700 10 monclient:
> >> >> >>>>> handle_subscribe_a
> >> >> >>>>> ck sent 2013-09-10 19:35:02.796448 renew after 2013-09-10
> >> >> >>>>>19:37:32.796448
> >> >> >>>>>     -4> 2013-09-10 19:35:02.800907 7fd1ba81f700  5 mds.0.58
> >> >> >>>>> ms_handle_connect on
> >> >> >>>>>  194.109.43.12:6802/53419
> >> >> >>>>>     -3> 2013-09-10 19:35:02.800927 7fd1ba81f700  5 mds.0.58
> >> >> >>>>> ms_handle_connect on
> >> >> >>>>>  194.109.43.13:6802/45791
> >> >> >>>>>     -2> 2013-09-10 19:35:02.801176 7fd1ba81f700  5 mds.0.58
> >> >> >>>>> ms_handle_connect on
> >> >> >>>>>  194.109.43.11:6800/16562
> >> >> >>>>>     -1> 2013-09-10 19:35:02.803546 7fd1ba81f700  1 --
> >> >> >>>>> 194.109.43.12:6800/67277 <
> >> >> >>>>> == osd.2 194.109.43.13:6802/45791 1 ==== osd_op_reply(3 mds_anchortable
> >> >> >>>>> [read 0~
> >> >> >>>>> 0] ack = -2 (No such file or directory)) v4 ==== 114+0+0 (3107677671 0
> >> >> >>>>> 0) 0x1e4d
> >> >> >>>>> e00 con 0x1e5ddc0
> >> >> >>>>>      0> 2013-09-10 19:35:02.805611 7fd1ba81f700 -1 mds/MDSTable.cc: In
> >> >> >>>>> function
> >> >> >>>>> 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)' thread
> >> >> >>>>> 7fd1ba81f700 ti
> >> >> >>>>> me 2013-09-10 19:35:02.803673
> >> >> >>>>> mds/MDSTable.cc: 152: FAILED assert(r >= 0)
> >> >> >>>>>
> >> >> >>>>>  ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
> >> >> >>>>>  1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x44f)
> >> >> >>>>>[0x77ce7f]
> >> >> >>>>>  2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe3b) [0x7d891b]
> >> >> >>>>>  3: (MDS::handle_core_message(Message*)+0x987) [0x56f527]
> >> >> >>>>>  4: (MDS::_dispatch(Message*)+0x2f) [0x56f5ef]
> >> >> >>>>>  5: (MDS::ms_dispatch(Message*)+0x19b) [0x5710bb]
> >> >> >>>>>  6: (DispatchQueue::entry()+0x592) [0x92e432]
> >> >> >>>>>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x8a59bd]
> >> >> >>>>>  8: (()+0x68ca) [0x7fd1bed298ca]
> >> >> >>>>>  9: (clone()+0x6d) [0x7fd1bda5cb6d]
> >> >> >>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >> >> >>>>> needed to interpret this.
> >> >> >>>>>
> >> >> >>>>> When trying to mount CephFS, it just hangs now.  Sometimes, an MDS
> >> >> >>>>>stays
> >> >> >>>>> up for a while, but will eventually crash again.  This CephFS was
> >> >> >>>>> created on 0.67 and I haven't done anything but mount and use it under
> >> >> >>>>> very light load in the mean time.
> >> >> >>>>>
> >> >> >>>>> Any ideas, or if you need more info, let me know.  It would be nice to
> >> >> >>>>> get my data back, but I have backups too.
> >> >> >>>>
> >> >> >>>> Does the filesystem have any data in it? Every time we've seen this
> >> >> >>>> error it's been on an empty cluster which had some weird issue with
> >> >> >>>> startup.
> >> >> >>>
> >> >> >>>This one certainly had some data on it, yes.  A couple of 100's of GBs
> >> >> >>>of disk-images and a couple of trees of smaller files.  Most of them
> >> >> >>>accessed very rarely since being copied on.
> >> >> >>>
> >> >> >>>
> >> >> >>>   Regards,
> >> >> >>>
> >> >> >>>      Oliver
> >> >> >>>_______________________________________________
> >> >> >>>ceph-users mailing list
> >> >> >>>ceph-users@xxxxxxxxxxxxxx
> >> >> >>>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >>
> >> >>
> >> >
> >> >
> >>
> >
> >
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com