Re: "umount" of ceph filesystem that has become unavailable hangs forever

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 27 Jul 2010 10:18:20 -0700 (PDT)

On Tue, 27 Jul 2010, Anton VG wrote:
> Sage, is looks logical, that if the user issues "umount -l" - the code
> should give up syncing and clear the state. Or possibly there should
> be a /proc/...whatever or /sys/...whatever setting to define a default
> timeout to give up syncing.

Yeah, I suspect blanket timeouts are going to be the only way to 
really resolve this.  I played around with it a bit yesterday and the 
problem is that even if I make the ceph sync_fs hooks timeout (or 
killable via SIGKILL), a 'sync' still hangs in the generic VFS code when 
it tries to write out dirty inodes.  

I think a 'soft' mount option that allows any server operations time out 
is the way to go.  Currently we behave like nfs's 'hard':

       soft           If an NFS file operation has a major timeout then report
                      an I/O error to the calling program.  The default is  to
                      continue retrying NFS file operations indefinitely.

       hard           If an NFS file operation has a major timeout then report
                      "server not responding"  on  the  console  and  continue

This is http://tracker.newdream.net/issues/206.

sage

> 
> 2010/7/24 Sébastien Paolacci <sebastien.paolacci@xxxxxxxxx>:
> > Hello Sage,
> >
> > I was just trying to relive an old thread but I definitely agree that
> > I didn't make my point clear enough, sorry for that.
> >
> > The global idea is that whatever happen server-side, the client should
> > be able to be left in a clean state. By clean I mean that, except data
> > explicitly pushed to (pulled from) the tested ceph share, no other
> > side effect from the test session should be visible.
> >
> > The real issue with hanged unmounts is obviously not with the console
> > been frozen but with all the subsequent syncs that are going to follow
> > the same path (and syncs do happen in a real life scenarios, e.g. when
> > softly halting/restarting a box).
> >
> > Explicitly aborting the sync (whatever the way) is indeed a seductive
> > option that would almost solve the point without going so far from a
> > sync decent safe behavior.
> >
> > As a matter of convenience, should I just have a few hundred nodes to
> > restart, I would however expect the sync to automatically abort
> > because a delay I take the responsibility for as expired and the
> > kclient is still deeply confident with the ceph tragic dead.
> >
> > So let's go back to a concrete failure case that can bother a client box ;) :
> >  - a fresh new and just formated ceph instance is started.
> >  - the share is mounted on a separate box and one single file is
> > created (touch /mnt/test).
> >  - ceph daemons are hardly killed (pkill -9 on cosd, cmds, cmon) and
> > the share is unmonted.
> >
> > The umount hang "as expected", but If I wait long enough I'll eventually get a
> >
> > Jul 24 09:31:16: [ 1163.642060] ceph: loaded (mon/mds/osd proto
> > 15/32/24, osdmap 5/5 5/5)
> > Jul 24 09:31:16: [ 1163.646098] ceph: client4099 fsid
> > b003239e-a249-7c47-f7ca-a9b75da2a445
> > Jul 24 09:31:16: [ 1163.646353] ceph: mon0 192.168.0.3:6789 session established
> > Jul 24 09:32:05: [ 1213.290150] ceph: mon0 192.168.0.3:6789 session
> > lost, hunting for new mon
> > Jul 24 09:33:01: [ 1269.227827] ceph: mds0 caps stale
> > Jul 24 09:33:16: [ 1284.219034] ceph: mds0 caps stale
> > Jul 24 09:35:52: [ 1439.844419] umount        D 0000000000000000     0
> >  2819   2788 0x00000000
> > Jul 24 09:35:52: [ 1439.844425]  ffff880127a5b880 0000000000000086
> > 0000000000000000 0000000000015640
> > Jul 24 09:35:52: [ 1439.844430]  0000000000015640 0000000000015640
> > 000000000000f8a0 ffff880124ef1fd8
> > Jul 24 09:35:52: [ 1439.844435]  0000000000015640 0000000000015640
> > ffff880086c8b170 ffff880086c8b468
> > Jul 24 09:35:52: [ 1439.844439] Call Trace:
> > Jul 24 09:35:52: [ 1439.844455]  [<ffffffffa051b740>] ?
> > ceph_mdsc_sync+0x1be/0x1da [ceph]
> > Jul 24 09:35:52: [ 1439.844462]  [<ffffffff81064afa>] ?
> > autoremove_wake_function+0x0/0x2e
> > Jul 24 09:35:52: [ 1439.844473]  [<ffffffffa05210ac>] ?
> > ceph_osdc_sync+0x1d/0xc1 [ceph]
> > Jul 24 09:35:52: [ 1439.844479]  [<ffffffffa050931f>] ?
> > ceph_syncfs+0x2a/0x2e [ceph]
> > Jul 24 09:35:52: [ 1439.844485]  [<ffffffff8110b065>] ?
> > __sync_filesystem+0x5f/0x70
> > Jul 24 09:35:52: [ 1439.844489]  [<ffffffff8110b1de>] ?
> > sync_filesystem+0x2e/0x44
> > Jul 24 09:35:52: [ 1439.844494]  [<ffffffff810efdfa>] ?
> > generic_shutdown_super+0x21/0xfa
> > Jul 24 09:35:52: [ 1439.844498]  [<ffffffff810eff16>] ? kill_anon_super+0x9/0x40
> > Jul 24 09:35:52: [ 1439.844505]  [<ffffffffa05082ab>] ?
> > ceph_kill_sb+0x24/0x47 [ceph]
> > Jul 24 09:35:52: [ 1439.844509]  [<ffffffff810f05c5>] ?
> > deactivate_super+0x60/0x77
> > Jul 24 09:35:52: [ 1439.844514]  [<ffffffff81102da3>] ? sys_umount+0x2c3/0x2f2
> > Jul 24 09:35:52: [ 1439.844521]  [<ffffffff81010b42>] ?
> > system_call_fastpath+0x16/0x1b
> > Jul 24 09:37:06: [ 1514.085107] ceph: mds0 hung
> > Jul 24 09:37:52: [ 1559.774508] umount        D 0000000000000000     0
> >  2819   2788 0x00000000
> > Jul 24 09:37:52: [ 1559.774514]  ffff880127a5b880 0000000000000086
> > 0000000000000000 0000000000015640
> > Jul 24 09:37:52: [ 1559.774519]  0000000000015640 0000000000015640
> > 000000000000f8a0 ffff880124ef1fd8
> > Jul 24 09:37:52: [ 1559.774524]  0000000000015640 0000000000015640
> > ffff880086c8b170 ffff880086c8b468
> > Jul 24 09:37:52: [ 1559.774528] Call Trace:
> > Jul 24 09:37:52: [ 1559.774545]  [<ffffffffa051b740>] ?
> > ceph_mdsc_sync+0x1be/0x1da [ceph]
> > Jul 24 09:37:52: [ 1559.774552]  [<ffffffff81064afa>] ?
> > autoremove_wake_function+0x0/0x2e
> > Jul 24 09:37:52: [ 1559.774562]  [<ffffffffa05210ac>] ?
> > ceph_osdc_sync+0x1d/0xc1 [ceph]
> > Jul 24 09:37:52: [ 1559.774569]  [<ffffffffa050931f>] ?
> > ceph_syncfs+0x2a/0x2e [ceph]
> > Jul 24 09:37:52: [ 1559.774574]  [<ffffffff8110b065>] ?
> > __sync_filesystem+0x5f/0x70
> > Jul 24 09:37:52: [ 1559.774578]  [<ffffffff8110b1de>] ?
> > sync_filesystem+0x2e/0x44
> > Jul 24 09:37:52: [ 1559.774584]  [<ffffffff810efdfa>] ?
> > generic_shutdown_super+0x21/0xfa
> > Jul 24 09:37:52: [ 1559.774589]  [<ffffffff810eff16>] ? kill_anon_super+0x9/0x40
> > Jul 24 09:37:52: [ 1559.774595]  [<ffffffffa05082ab>] ?
> > ceph_kill_sb+0x24/0x47 [ceph]
> > Jul 24 09:37:52: [ 1559.774600]  [<ffffffff810f05c5>] ?
> > deactivate_super+0x60/0x77
> > Jul 24 09:37:52: [ 1559.774604]  [<ffffffff81102da3>] ? sys_umount+0x2c3/0x2f2
> > Jul 24 09:37:52: [ 1559.774612]  [<ffffffff81010b42>] ?
> > system_call_fastpath+0x16/0x1b
> > (... repeating forever ...)
> >
> > The box now as to be hardly powered off and a fsck will possibly
> > follow the restart...
> >
> > I'm not saying that this situation is not to be expected when testing
> > a not prod ready system, I'm just trying to emphasize that client
> > safety may actually be a blocking point for some more people to give a
> > try.
> >
> > Hope this clarifies,
> > Sebastien
> >
> >
> > 2010/7/23 Sage Weil <sage@xxxxxxxxxxxx>:
> >> On Fri, 23 Jul 2010, Sébastien Paolacci wrote:
> >>> Hello Sage,
> >>>
> >>> I would like to emphasize that this issue is somewhat annoying, even
> >>> for experiment purpose: I definitely expect my test server to not
> >>> behave safely, crash, burn or whatever, but having a client side
> >>> impact as deep as needed a (hard) reboot to solved a hanged ceph
> >>> really prevent me from testing with real life payloads.
> >>
> >> Maybe you can clarify for me exactly where the problem is.  'umount -f'
> >> should work.  'umount -l' should do a lazy unmount (detach from
> >> namespace), but the actual unmount code may currently hang.  It's
> >> debateable how that can/should be solved, since it's the 'sync' stage that
> >> hangs, and it's not clear we should ever 'give up' on that without an
> >> administrator telling us to (*).
> >>
> >> What problem do you actually see, though?  Why does it matter, or why do
> >> you care, if the 'umount -l' leaves some kernel threads trying to umount?
> >> Is it just annoying because it Shouldn't Do That, or does it actually
> >> cause a problem for you?
> >>
> >> It may be that if you try to remount the same fs, the old superblock gets
> >> reused, and the mount fails somehow... I haven't tried that.  That would
> >> be an easy fix, though.
> >>
> >> Any clarification would be helpful!  Thanks-
> >> sage
> >>
> >>
> >> * Maybe a hook like /sys/kernel/debug/ceph/.../abort_sync that you can
> >> echo 1 to would be sufficient to make it give up on a sync (in the umount
> >> -l case, the sync prior to the actual unmount).
> >>
> >>
> >>>
> >>> I understand that it's not an easy point but a lot of my colleagues
> >>> are not really whiling to sacrifice even their dev workstation to play
> >>> during spare time... sad world ;)
> >>>
> >>> Sebastien
> >>>
> >>> On Wed, 16 Jun 2010, Peter Niemayer wrote:
> >>> > Hi,
> >>> >
> >>> > trying to "umount" a formerly mounted ceph filesystem that has become
> >>> > unavailable (osd crashed, then msd/mon were shut down using /etc/init.d/ceph
> >>> > stop) results in "umount" hanging forever in
> >>> > "D" state.
> >>> >
> >>> > Strangely, "umount -f" started from another terminal reports
> >>> > the ceph filesystem as not being mounted anymore, which is consistent
> >>> > with what the mount-table says.
> >>> >
> >>> > The kernel keeps emitting the following messages from time to time:
> >>> > > Jun 16 17:25:29 gitega kernel: ceph:  tid 211912 timed out on osd0, will
> >>> > > reset osd
> >>> > > Jun 16 17:25:35 gitega kernel: ceph: mon0 10.166.166.1:6789 connection
> >>> > > failed
> >>> > > Jun 16 17:26:15 gitega last message repeated 4 times
> >>> >
> >>> > I would have expected the "umount" to terminate at least after some generous
> >>> > timeout.
> >>> >
> >>> > Ceph should probably support something like the "soft,intr" options
> >>> > of NFS, because if the only supported way of mounting is one where
> >>> > a client is more or less stuck-until-reboot when the service fails,
> >>> > many potential test-configurations involving Ceph are way too dangerous
> >>> > to try...
> >>>
> >>> Yeah, being able to force it to shut down when servers are unresponsive is
> >>> definitely the intent.  'umount -f' should work.  It sounds like the
> >>> problem is related to the initial 'umount' (which doesn't time out)
> >>> followed by 'umount -f'.
> >>>
> >>> I'm hesitant to add a blanket umount timeout, as that could prevent proper
> >>> writeout of cached data/metadata in some cases.  So I think the goal
> >>> should be that if a normal umount hangs for some reason, you should be
> >>> able to intervene to add the 'force' if things don't go well.
> >>>
> >>> sage
> >>> --
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>