Re: "umount" of ceph filesystem that has become unavailable hangs forever

Sébastien Paolacci <sebastien.paolacci@xxxxxxxxx> · Sat, 24 Jul 2010 10:36:56 +0200

Hello Sage,

I was just trying to relive an old thread but I definitely agree that
I didn't make my point clear enough, sorry for that.

The global idea is that whatever happen server-side, the client should
be able to be left in a clean state. By clean I mean that, except data
explicitly pushed to (pulled from) the tested ceph share, no other
side effect from the test session should be visible.

The real issue with hanged unmounts is obviously not with the console
been frozen but with all the subsequent syncs that are going to follow
the same path (and syncs do happen in a real life scenarios, e.g. when
softly halting/restarting a box).

Explicitly aborting the sync (whatever the way) is indeed a seductive
option that would almost solve the point without going so far from a
sync decent safe behavior.

As a matter of convenience, should I just have a few hundred nodes to
restart, I would however expect the sync to automatically abort
because a delay I take the responsibility for as expired and the
kclient is still deeply confident with the ceph tragic dead.

So let's go back to a concrete failure case that can bother a client box ;) :
 - a fresh new and just formated ceph instance is started.
 - the share is mounted on a separate box and one single file is
created (touch /mnt/test).
 - ceph daemons are hardly killed (pkill -9 on cosd, cmds, cmon) and
the share is unmonted.

The umount hang "as expected", but If I wait long enough I'll eventually get a

Jul 24 09:31:16: [ 1163.642060] ceph: loaded (mon/mds/osd proto
15/32/24, osdmap 5/5 5/5)
Jul 24 09:31:16: [ 1163.646098] ceph: client4099 fsid
b003239e-a249-7c47-f7ca-a9b75da2a445
Jul 24 09:31:16: [ 1163.646353] ceph: mon0 192.168.0.3:6789 session established
Jul 24 09:32:05: [ 1213.290150] ceph: mon0 192.168.0.3:6789 session
lost, hunting for new mon
Jul 24 09:33:01: [ 1269.227827] ceph: mds0 caps stale
Jul 24 09:33:16: [ 1284.219034] ceph: mds0 caps stale
Jul 24 09:35:52: [ 1439.844419] umount        D 0000000000000000     0
 2819   2788 0x00000000
Jul 24 09:35:52: [ 1439.844425]  ffff880127a5b880 0000000000000086
0000000000000000 0000000000015640
Jul 24 09:35:52: [ 1439.844430]  0000000000015640 0000000000015640
000000000000f8a0 ffff880124ef1fd8
Jul 24 09:35:52: [ 1439.844435]  0000000000015640 0000000000015640
ffff880086c8b170 ffff880086c8b468
Jul 24 09:35:52: [ 1439.844439] Call Trace:
Jul 24 09:35:52: [ 1439.844455]  [<ffffffffa051b740>] ?
ceph_mdsc_sync+0x1be/0x1da [ceph]
Jul 24 09:35:52: [ 1439.844462]  [<ffffffff81064afa>] ?
autoremove_wake_function+0x0/0x2e
Jul 24 09:35:52: [ 1439.844473]  [<ffffffffa05210ac>] ?
ceph_osdc_sync+0x1d/0xc1 [ceph]
Jul 24 09:35:52: [ 1439.844479]  [<ffffffffa050931f>] ?
ceph_syncfs+0x2a/0x2e [ceph]
Jul 24 09:35:52: [ 1439.844485]  [<ffffffff8110b065>] ?
__sync_filesystem+0x5f/0x70
Jul 24 09:35:52: [ 1439.844489]  [<ffffffff8110b1de>] ?
sync_filesystem+0x2e/0x44
Jul 24 09:35:52: [ 1439.844494]  [<ffffffff810efdfa>] ?
generic_shutdown_super+0x21/0xfa
Jul 24 09:35:52: [ 1439.844498]  [<ffffffff810eff16>] ? kill_anon_super+0x9/0x40
Jul 24 09:35:52: [ 1439.844505]  [<ffffffffa05082ab>] ?
ceph_kill_sb+0x24/0x47 [ceph]
Jul 24 09:35:52: [ 1439.844509]  [<ffffffff810f05c5>] ?
deactivate_super+0x60/0x77
Jul 24 09:35:52: [ 1439.844514]  [<ffffffff81102da3>] ? sys_umount+0x2c3/0x2f2
Jul 24 09:35:52: [ 1439.844521]  [<ffffffff81010b42>] ?
system_call_fastpath+0x16/0x1b
Jul 24 09:37:06: [ 1514.085107] ceph: mds0 hung
Jul 24 09:37:52: [ 1559.774508] umount        D 0000000000000000     0
 2819   2788 0x00000000
Jul 24 09:37:52: [ 1559.774514]  ffff880127a5b880 0000000000000086
0000000000000000 0000000000015640
Jul 24 09:37:52: [ 1559.774519]  0000000000015640 0000000000015640
000000000000f8a0 ffff880124ef1fd8
Jul 24 09:37:52: [ 1559.774524]  0000000000015640 0000000000015640
ffff880086c8b170 ffff880086c8b468
Jul 24 09:37:52: [ 1559.774528] Call Trace:
Jul 24 09:37:52: [ 1559.774545]  [<ffffffffa051b740>] ?
ceph_mdsc_sync+0x1be/0x1da [ceph]
Jul 24 09:37:52: [ 1559.774552]  [<ffffffff81064afa>] ?
autoremove_wake_function+0x0/0x2e
Jul 24 09:37:52: [ 1559.774562]  [<ffffffffa05210ac>] ?
ceph_osdc_sync+0x1d/0xc1 [ceph]
Jul 24 09:37:52: [ 1559.774569]  [<ffffffffa050931f>] ?
ceph_syncfs+0x2a/0x2e [ceph]
Jul 24 09:37:52: [ 1559.774574]  [<ffffffff8110b065>] ?
__sync_filesystem+0x5f/0x70
Jul 24 09:37:52: [ 1559.774578]  [<ffffffff8110b1de>] ?
sync_filesystem+0x2e/0x44
Jul 24 09:37:52: [ 1559.774584]  [<ffffffff810efdfa>] ?
generic_shutdown_super+0x21/0xfa
Jul 24 09:37:52: [ 1559.774589]  [<ffffffff810eff16>] ? kill_anon_super+0x9/0x40
Jul 24 09:37:52: [ 1559.774595]  [<ffffffffa05082ab>] ?
ceph_kill_sb+0x24/0x47 [ceph]
Jul 24 09:37:52: [ 1559.774600]  [<ffffffff810f05c5>] ?
deactivate_super+0x60/0x77
Jul 24 09:37:52: [ 1559.774604]  [<ffffffff81102da3>] ? sys_umount+0x2c3/0x2f2
Jul 24 09:37:52: [ 1559.774612]  [<ffffffff81010b42>] ?
system_call_fastpath+0x16/0x1b
(... repeating forever ...)

The box now as to be hardly powered off and a fsck will possibly
follow the restart...

I'm not saying that this situation is not to be expected when testing
a not prod ready system, I'm just trying to emphasize that client
safety may actually be a blocking point for some more people to give a
try.

Hope this clarifies,
Sebastien

2010/7/23 Sage Weil <sage@xxxxxxxxxxxx>:
> On Fri, 23 Jul 2010, Sébastien Paolacci wrote:
>> Hello Sage,
>>
>> I would like to emphasize that this issue is somewhat annoying, even
>> for experiment purpose: I definitely expect my test server to not
>> behave safely, crash, burn or whatever, but having a client side
>> impact as deep as needed a (hard) reboot to solved a hanged ceph
>> really prevent me from testing with real life payloads.
>
> Maybe you can clarify for me exactly where the problem is.  'umount -f'
> should work.  'umount -l' should do a lazy unmount (detach from
> namespace), but the actual unmount code may currently hang.  It's
> debateable how that can/should be solved, since it's the 'sync' stage that
> hangs, and it's not clear we should ever 'give up' on that without an
> administrator telling us to (*).
>
> What problem do you actually see, though?  Why does it matter, or why do
> you care, if the 'umount -l' leaves some kernel threads trying to umount?
> Is it just annoying because it Shouldn't Do That, or does it actually
> cause a problem for you?
>
> It may be that if you try to remount the same fs, the old superblock gets
> reused, and the mount fails somehow... I haven't tried that.  That would
> be an easy fix, though.
>
> Any clarification would be helpful!  Thanks-
> sage
>
>
> * Maybe a hook like /sys/kernel/debug/ceph/.../abort_sync that you can
> echo 1 to would be sufficient to make it give up on a sync (in the umount
> -l case, the sync prior to the actual unmount).
>
>
>>
>> I understand that it's not an easy point but a lot of my colleagues
>> are not really whiling to sacrifice even their dev workstation to play
>> during spare time... sad world ;)
>>
>> Sebastien
>>
>> On Wed, 16 Jun 2010, Peter Niemayer wrote:
>> > Hi,
>> >
>> > trying to "umount" a formerly mounted ceph filesystem that has become
>> > unavailable (osd crashed, then msd/mon were shut down using /etc/init.d/ceph
>> > stop) results in "umount" hanging forever in
>> > "D" state.
>> >
>> > Strangely, "umount -f" started from another terminal reports
>> > the ceph filesystem as not being mounted anymore, which is consistent
>> > with what the mount-table says.
>> >
>> > The kernel keeps emitting the following messages from time to time:
>> > > Jun 16 17:25:29 gitega kernel: ceph:  tid 211912 timed out on osd0, will
>> > > reset osd
>> > > Jun 16 17:25:35 gitega kernel: ceph: mon0 10.166.166.1:6789 connection
>> > > failed
>> > > Jun 16 17:26:15 gitega last message repeated 4 times
>> >
>> > I would have expected the "umount" to terminate at least after some generous
>> > timeout.
>> >
>> > Ceph should probably support something like the "soft,intr" options
>> > of NFS, because if the only supported way of mounting is one where
>> > a client is more or less stuck-until-reboot when the service fails,
>> > many potential test-configurations involving Ceph are way too dangerous
>> > to try...
>>
>> Yeah, being able to force it to shut down when servers are unresponsive is
>> definitely the intent.  'umount -f' should work.  It sounds like the
>> problem is related to the initial 'umount' (which doesn't time out)
>> followed by 'umount -f'.
>>
>> I'm hesitant to add a blanket umount timeout, as that could prevent proper
>> writeout of cached data/metadata in some cases.  So I think the goal
>> should be that if a normal umount hangs for some reason, you should be
>> able to intervene to add the 'force' if things don't go well.
>>
>> sage
>> --
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html