Re: "umount" of ceph filesystem that has become unavailable hangs forever

Anton VG <anton.vazir@xxxxxxxxx> · Tue, 27 Jul 2010 15:49:28 +0500

Sage, is looks logical, that if the user issues "umount -l" - the code
should give up syncing and clear the state. Or possibly there should
be a /proc/...whatever or /sys/...whatever setting to define a default
timeout to give up syncing.

2010/7/24 Sébastien Paolacci <sebastien.paolacci@xxxxxxxxx>:
> Hello Sage,
>
> I was just trying to relive an old thread but I definitely agree that
> I didn't make my point clear enough, sorry for that.
>
> The global idea is that whatever happen server-side, the client should
> be able to be left in a clean state. By clean I mean that, except data
> explicitly pushed to (pulled from) the tested ceph share, no other
> side effect from the test session should be visible.
>
> The real issue with hanged unmounts is obviously not with the console
> been frozen but with all the subsequent syncs that are going to follow
> the same path (and syncs do happen in a real life scenarios, e.g. when
> softly halting/restarting a box).
>
> Explicitly aborting the sync (whatever the way) is indeed a seductive
> option that would almost solve the point without going so far from a
> sync decent safe behavior.
>
> As a matter of convenience, should I just have a few hundred nodes to
> restart, I would however expect the sync to automatically abort
> because a delay I take the responsibility for as expired and the
> kclient is still deeply confident with the ceph tragic dead.
>
> So let's go back to a concrete failure case that can bother a client box ;) :
>  - a fresh new and just formated ceph instance is started.
>  - the share is mounted on a separate box and one single file is
> created (touch /mnt/test).
>  - ceph daemons are hardly killed (pkill -9 on cosd, cmds, cmon) and
> the share is unmonted.
>
> The umount hang "as expected", but If I wait long enough I'll eventually get a
>
> Jul 24 09:31:16: [ 1163.642060] ceph: loaded (mon/mds/osd proto
> 15/32/24, osdmap 5/5 5/5)
> Jul 24 09:31:16: [ 1163.646098] ceph: client4099 fsid
> b003239e-a249-7c47-f7ca-a9b75da2a445
> Jul 24 09:31:16: [ 1163.646353] ceph: mon0 192.168.0.3:6789 session established
> Jul 24 09:32:05: [ 1213.290150] ceph: mon0 192.168.0.3:6789 session
> lost, hunting for new mon
> Jul 24 09:33:01: [ 1269.227827] ceph: mds0 caps stale
> Jul 24 09:33:16: [ 1284.219034] ceph: mds0 caps stale
> Jul 24 09:35:52: [ 1439.844419] umount        D 0000000000000000     0
>  2819   2788 0x00000000
> Jul 24 09:35:52: [ 1439.844425]  ffff880127a5b880 0000000000000086
> 0000000000000000 0000000000015640
> Jul 24 09:35:52: [ 1439.844430]  0000000000015640 0000000000015640
> 000000000000f8a0 ffff880124ef1fd8
> Jul 24 09:35:52: [ 1439.844435]  0000000000015640 0000000000015640
> ffff880086c8b170 ffff880086c8b468
> Jul 24 09:35:52: [ 1439.844439] Call Trace:
> Jul 24 09:35:52: [ 1439.844455]  [<ffffffffa051b740>] ?
> ceph_mdsc_sync+0x1be/0x1da [ceph]
> Jul 24 09:35:52: [ 1439.844462]  [<ffffffff81064afa>] ?
> autoremove_wake_function+0x0/0x2e
> Jul 24 09:35:52: [ 1439.844473]  [<ffffffffa05210ac>] ?
> ceph_osdc_sync+0x1d/0xc1 [ceph]
> Jul 24 09:35:52: [ 1439.844479]  [<ffffffffa050931f>] ?
> ceph_syncfs+0x2a/0x2e [ceph]
> Jul 24 09:35:52: [ 1439.844485]  [<ffffffff8110b065>] ?
> __sync_filesystem+0x5f/0x70
> Jul 24 09:35:52: [ 1439.844489]  [<ffffffff8110b1de>] ?
> sync_filesystem+0x2e/0x44
> Jul 24 09:35:52: [ 1439.844494]  [<ffffffff810efdfa>] ?
> generic_shutdown_super+0x21/0xfa
> Jul 24 09:35:52: [ 1439.844498]  [<ffffffff810eff16>] ? kill_anon_super+0x9/0x40
> Jul 24 09:35:52: [ 1439.844505]  [<ffffffffa05082ab>] ?
> ceph_kill_sb+0x24/0x47 [ceph]
> Jul 24 09:35:52: [ 1439.844509]  [<ffffffff810f05c5>] ?
> deactivate_super+0x60/0x77
> Jul 24 09:35:52: [ 1439.844514]  [<ffffffff81102da3>] ? sys_umount+0x2c3/0x2f2
> Jul 24 09:35:52: [ 1439.844521]  [<ffffffff81010b42>] ?
> system_call_fastpath+0x16/0x1b
> Jul 24 09:37:06: [ 1514.085107] ceph: mds0 hung
> Jul 24 09:37:52: [ 1559.774508] umount        D 0000000000000000     0
>  2819   2788 0x00000000
> Jul 24 09:37:52: [ 1559.774514]  ffff880127a5b880 0000000000000086
> 0000000000000000 0000000000015640
> Jul 24 09:37:52: [ 1559.774519]  0000000000015640 0000000000015640
> 000000000000f8a0 ffff880124ef1fd8
> Jul 24 09:37:52: [ 1559.774524]  0000000000015640 0000000000015640
> ffff880086c8b170 ffff880086c8b468
> Jul 24 09:37:52: [ 1559.774528] Call Trace:
> Jul 24 09:37:52: [ 1559.774545]  [<ffffffffa051b740>] ?
> ceph_mdsc_sync+0x1be/0x1da [ceph]
> Jul 24 09:37:52: [ 1559.774552]  [<ffffffff81064afa>] ?
> autoremove_wake_function+0x0/0x2e
> Jul 24 09:37:52: [ 1559.774562]  [<ffffffffa05210ac>] ?
> ceph_osdc_sync+0x1d/0xc1 [ceph]
> Jul 24 09:37:52: [ 1559.774569]  [<ffffffffa050931f>] ?
> ceph_syncfs+0x2a/0x2e [ceph]
> Jul 24 09:37:52: [ 1559.774574]  [<ffffffff8110b065>] ?
> __sync_filesystem+0x5f/0x70
> Jul 24 09:37:52: [ 1559.774578]  [<ffffffff8110b1de>] ?
> sync_filesystem+0x2e/0x44
> Jul 24 09:37:52: [ 1559.774584]  [<ffffffff810efdfa>] ?
> generic_shutdown_super+0x21/0xfa
> Jul 24 09:37:52: [ 1559.774589]  [<ffffffff810eff16>] ? kill_anon_super+0x9/0x40
> Jul 24 09:37:52: [ 1559.774595]  [<ffffffffa05082ab>] ?
> ceph_kill_sb+0x24/0x47 [ceph]
> Jul 24 09:37:52: [ 1559.774600]  [<ffffffff810f05c5>] ?
> deactivate_super+0x60/0x77
> Jul 24 09:37:52: [ 1559.774604]  [<ffffffff81102da3>] ? sys_umount+0x2c3/0x2f2
> Jul 24 09:37:52: [ 1559.774612]  [<ffffffff81010b42>] ?
> system_call_fastpath+0x16/0x1b
> (... repeating forever ...)
>
> The box now as to be hardly powered off and a fsck will possibly
> follow the restart...
>
> I'm not saying that this situation is not to be expected when testing
> a not prod ready system, I'm just trying to emphasize that client
> safety may actually be a blocking point for some more people to give a
> try.
>
> Hope this clarifies,
> Sebastien
>
>
> 2010/7/23 Sage Weil <sage@xxxxxxxxxxxx>:
>> On Fri, 23 Jul 2010, Sébastien Paolacci wrote:
>>> Hello Sage,
>>>
>>> I would like to emphasize that this issue is somewhat annoying, even
>>> for experiment purpose: I definitely expect my test server to not
>>> behave safely, crash, burn or whatever, but having a client side
>>> impact as deep as needed a (hard) reboot to solved a hanged ceph
>>> really prevent me from testing with real life payloads.
>>
>> Maybe you can clarify for me exactly where the problem is.  'umount -f'
>> should work.  'umount -l' should do a lazy unmount (detach from
>> namespace), but the actual unmount code may currently hang.  It's
>> debateable how that can/should be solved, since it's the 'sync' stage that
>> hangs, and it's not clear we should ever 'give up' on that without an
>> administrator telling us to (*).
>>
>> What problem do you actually see, though?  Why does it matter, or why do
>> you care, if the 'umount -l' leaves some kernel threads trying to umount?
>> Is it just annoying because it Shouldn't Do That, or does it actually
>> cause a problem for you?
>>
>> It may be that if you try to remount the same fs, the old superblock gets
>> reused, and the mount fails somehow... I haven't tried that.  That would
>> be an easy fix, though.
>>
>> Any clarification would be helpful!  Thanks-
>> sage
>>
>>
>> * Maybe a hook like /sys/kernel/debug/ceph/.../abort_sync that you can
>> echo 1 to would be sufficient to make it give up on a sync (in the umount
>> -l case, the sync prior to the actual unmount).
>>
>>
>>>
>>> I understand that it's not an easy point but a lot of my colleagues
>>> are not really whiling to sacrifice even their dev workstation to play
>>> during spare time... sad world ;)
>>>
>>> Sebastien
>>>
>>> On Wed, 16 Jun 2010, Peter Niemayer wrote:
>>> > Hi,
>>> >
>>> > trying to "umount" a formerly mounted ceph filesystem that has become
>>> > unavailable (osd crashed, then msd/mon were shut down using /etc/init.d/ceph
>>> > stop) results in "umount" hanging forever in
>>> > "D" state.
>>> >
>>> > Strangely, "umount -f" started from another terminal reports
>>> > the ceph filesystem as not being mounted anymore, which is consistent
>>> > with what the mount-table says.
>>> >
>>> > The kernel keeps emitting the following messages from time to time:
>>> > > Jun 16 17:25:29 gitega kernel: ceph:  tid 211912 timed out on osd0, will
>>> > > reset osd
>>> > > Jun 16 17:25:35 gitega kernel: ceph: mon0 10.166.166.1:6789 connection
>>> > > failed
>>> > > Jun 16 17:26:15 gitega last message repeated 4 times
>>> >
>>> > I would have expected the "umount" to terminate at least after some generous
>>> > timeout.
>>> >
>>> > Ceph should probably support something like the "soft,intr" options
>>> > of NFS, because if the only supported way of mounting is one where
>>> > a client is more or less stuck-until-reboot when the service fails,
>>> > many potential test-configurations involving Ceph are way too dangerous
>>> > to try...
>>>
>>> Yeah, being able to force it to shut down when servers are unresponsive is
>>> definitely the intent.  'umount -f' should work.  It sounds like the
>>> problem is related to the initial 'umount' (which doesn't time out)
>>> followed by 'umount -f'.
>>>
>>> I'm hesitant to add a blanket umount timeout, as that could prevent proper
>>> writeout of cached data/metadata in some cases.  So I think the goal
>>> should be that if a normal umount hangs for some reason, you should be
>>> able to intervene to add the 'force' if things don't go well.
>>>
>>> sage
>>> --
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html