On Tue, 27 Jul 2010, Anton VG wrote: > Sage, is looks logical, that if the user issues "umount -l" - the code > should give up syncing and clear the state. Or possibly there should > be a /proc/...whatever or /sys/...whatever setting to define a default > timeout to give up syncing. Yeah, I suspect blanket timeouts are going to be the only way to really resolve this. I played around with it a bit yesterday and the problem is that even if I make the ceph sync_fs hooks timeout (or killable via SIGKILL), a 'sync' still hangs in the generic VFS code when it tries to write out dirty inodes. I think a 'soft' mount option that allows any server operations time out is the way to go. Currently we behave like nfs's 'hard': soft If an NFS file operation has a major timeout then report an I/O error to the calling program. The default is to continue retrying NFS file operations indefinitely. hard If an NFS file operation has a major timeout then report "server not responding" on the console and continue This is http://tracker.newdream.net/issues/206. sage > > 2010/7/24 Sébastien Paolacci <sebastien.paolacci@xxxxxxxxx>: > > Hello Sage, > > > > I was just trying to relive an old thread but I definitely agree that > > I didn't make my point clear enough, sorry for that. > > > > The global idea is that whatever happen server-side, the client should > > be able to be left in a clean state. By clean I mean that, except data > > explicitly pushed to (pulled from) the tested ceph share, no other > > side effect from the test session should be visible. > > > > The real issue with hanged unmounts is obviously not with the console > > been frozen but with all the subsequent syncs that are going to follow > > the same path (and syncs do happen in a real life scenarios, e.g. when > > softly halting/restarting a box). > > > > Explicitly aborting the sync (whatever the way) is indeed a seductive > > option that would almost solve the point without going so far from a > > sync decent safe behavior. > > > > As a matter of convenience, should I just have a few hundred nodes to > > restart, I would however expect the sync to automatically abort > > because a delay I take the responsibility for as expired and the > > kclient is still deeply confident with the ceph tragic dead. > > > > So let's go back to a concrete failure case that can bother a client box ;) : > > - a fresh new and just formated ceph instance is started. > > - the share is mounted on a separate box and one single file is > > created (touch /mnt/test). > > - ceph daemons are hardly killed (pkill -9 on cosd, cmds, cmon) and > > the share is unmonted. > > > > The umount hang "as expected", but If I wait long enough I'll eventually get a > > > > Jul 24 09:31:16: [ 1163.642060] ceph: loaded (mon/mds/osd proto > > 15/32/24, osdmap 5/5 5/5) > > Jul 24 09:31:16: [ 1163.646098] ceph: client4099 fsid > > b003239e-a249-7c47-f7ca-a9b75da2a445 > > Jul 24 09:31:16: [ 1163.646353] ceph: mon0 192.168.0.3:6789 session established > > Jul 24 09:32:05: [ 1213.290150] ceph: mon0 192.168.0.3:6789 session > > lost, hunting for new mon > > Jul 24 09:33:01: [ 1269.227827] ceph: mds0 caps stale > > Jul 24 09:33:16: [ 1284.219034] ceph: mds0 caps stale > > Jul 24 09:35:52: [ 1439.844419] umount D 0000000000000000 0 > > 2819 2788 0x00000000 > > Jul 24 09:35:52: [ 1439.844425] ffff880127a5b880 0000000000000086 > > 0000000000000000 0000000000015640 > > Jul 24 09:35:52: [ 1439.844430] 0000000000015640 0000000000015640 > > 000000000000f8a0 ffff880124ef1fd8 > > Jul 24 09:35:52: [ 1439.844435] 0000000000015640 0000000000015640 > > ffff880086c8b170 ffff880086c8b468 > > Jul 24 09:35:52: [ 1439.844439] Call Trace: > > Jul 24 09:35:52: [ 1439.844455] [<ffffffffa051b740>] ? > > ceph_mdsc_sync+0x1be/0x1da [ceph] > > Jul 24 09:35:52: [ 1439.844462] [<ffffffff81064afa>] ? > > autoremove_wake_function+0x0/0x2e > > Jul 24 09:35:52: [ 1439.844473] [<ffffffffa05210ac>] ? > > ceph_osdc_sync+0x1d/0xc1 [ceph] > > Jul 24 09:35:52: [ 1439.844479] [<ffffffffa050931f>] ? > > ceph_syncfs+0x2a/0x2e [ceph] > > Jul 24 09:35:52: [ 1439.844485] [<ffffffff8110b065>] ? > > __sync_filesystem+0x5f/0x70 > > Jul 24 09:35:52: [ 1439.844489] [<ffffffff8110b1de>] ? > > sync_filesystem+0x2e/0x44 > > Jul 24 09:35:52: [ 1439.844494] [<ffffffff810efdfa>] ? > > generic_shutdown_super+0x21/0xfa > > Jul 24 09:35:52: [ 1439.844498] [<ffffffff810eff16>] ? kill_anon_super+0x9/0x40 > > Jul 24 09:35:52: [ 1439.844505] [<ffffffffa05082ab>] ? > > ceph_kill_sb+0x24/0x47 [ceph] > > Jul 24 09:35:52: [ 1439.844509] [<ffffffff810f05c5>] ? > > deactivate_super+0x60/0x77 > > Jul 24 09:35:52: [ 1439.844514] [<ffffffff81102da3>] ? sys_umount+0x2c3/0x2f2 > > Jul 24 09:35:52: [ 1439.844521] [<ffffffff81010b42>] ? > > system_call_fastpath+0x16/0x1b > > Jul 24 09:37:06: [ 1514.085107] ceph: mds0 hung > > Jul 24 09:37:52: [ 1559.774508] umount D 0000000000000000 0 > > 2819 2788 0x00000000 > > Jul 24 09:37:52: [ 1559.774514] ffff880127a5b880 0000000000000086 > > 0000000000000000 0000000000015640 > > Jul 24 09:37:52: [ 1559.774519] 0000000000015640 0000000000015640 > > 000000000000f8a0 ffff880124ef1fd8 > > Jul 24 09:37:52: [ 1559.774524] 0000000000015640 0000000000015640 > > ffff880086c8b170 ffff880086c8b468 > > Jul 24 09:37:52: [ 1559.774528] Call Trace: > > Jul 24 09:37:52: [ 1559.774545] [<ffffffffa051b740>] ? > > ceph_mdsc_sync+0x1be/0x1da [ceph] > > Jul 24 09:37:52: [ 1559.774552] [<ffffffff81064afa>] ? > > autoremove_wake_function+0x0/0x2e > > Jul 24 09:37:52: [ 1559.774562] [<ffffffffa05210ac>] ? > > ceph_osdc_sync+0x1d/0xc1 [ceph] > > Jul 24 09:37:52: [ 1559.774569] [<ffffffffa050931f>] ? > > ceph_syncfs+0x2a/0x2e [ceph] > > Jul 24 09:37:52: [ 1559.774574] [<ffffffff8110b065>] ? > > __sync_filesystem+0x5f/0x70 > > Jul 24 09:37:52: [ 1559.774578] [<ffffffff8110b1de>] ? > > sync_filesystem+0x2e/0x44 > > Jul 24 09:37:52: [ 1559.774584] [<ffffffff810efdfa>] ? > > generic_shutdown_super+0x21/0xfa > > Jul 24 09:37:52: [ 1559.774589] [<ffffffff810eff16>] ? kill_anon_super+0x9/0x40 > > Jul 24 09:37:52: [ 1559.774595] [<ffffffffa05082ab>] ? > > ceph_kill_sb+0x24/0x47 [ceph] > > Jul 24 09:37:52: [ 1559.774600] [<ffffffff810f05c5>] ? > > deactivate_super+0x60/0x77 > > Jul 24 09:37:52: [ 1559.774604] [<ffffffff81102da3>] ? sys_umount+0x2c3/0x2f2 > > Jul 24 09:37:52: [ 1559.774612] [<ffffffff81010b42>] ? > > system_call_fastpath+0x16/0x1b > > (... repeating forever ...) > > > > The box now as to be hardly powered off and a fsck will possibly > > follow the restart... > > > > I'm not saying that this situation is not to be expected when testing > > a not prod ready system, I'm just trying to emphasize that client > > safety may actually be a blocking point for some more people to give a > > try. > > > > Hope this clarifies, > > Sebastien > > > > > > 2010/7/23 Sage Weil <sage@xxxxxxxxxxxx>: > >> On Fri, 23 Jul 2010, Sébastien Paolacci wrote: > >>> Hello Sage, > >>> > >>> I would like to emphasize that this issue is somewhat annoying, even > >>> for experiment purpose: I definitely expect my test server to not > >>> behave safely, crash, burn or whatever, but having a client side > >>> impact as deep as needed a (hard) reboot to solved a hanged ceph > >>> really prevent me from testing with real life payloads. > >> > >> Maybe you can clarify for me exactly where the problem is. 'umount -f' > >> should work. 'umount -l' should do a lazy unmount (detach from > >> namespace), but the actual unmount code may currently hang. It's > >> debateable how that can/should be solved, since it's the 'sync' stage that > >> hangs, and it's not clear we should ever 'give up' on that without an > >> administrator telling us to (*). > >> > >> What problem do you actually see, though? Why does it matter, or why do > >> you care, if the 'umount -l' leaves some kernel threads trying to umount? > >> Is it just annoying because it Shouldn't Do That, or does it actually > >> cause a problem for you? > >> > >> It may be that if you try to remount the same fs, the old superblock gets > >> reused, and the mount fails somehow... I haven't tried that. That would > >> be an easy fix, though. > >> > >> Any clarification would be helpful! Thanks- > >> sage > >> > >> > >> * Maybe a hook like /sys/kernel/debug/ceph/.../abort_sync that you can > >> echo 1 to would be sufficient to make it give up on a sync (in the umount > >> -l case, the sync prior to the actual unmount). > >> > >> > >>> > >>> I understand that it's not an easy point but a lot of my colleagues > >>> are not really whiling to sacrifice even their dev workstation to play > >>> during spare time... sad world ;) > >>> > >>> Sebastien > >>> > >>> On Wed, 16 Jun 2010, Peter Niemayer wrote: > >>> > Hi, > >>> > > >>> > trying to "umount" a formerly mounted ceph filesystem that has become > >>> > unavailable (osd crashed, then msd/mon were shut down using /etc/init.d/ceph > >>> > stop) results in "umount" hanging forever in > >>> > "D" state. > >>> > > >>> > Strangely, "umount -f" started from another terminal reports > >>> > the ceph filesystem as not being mounted anymore, which is consistent > >>> > with what the mount-table says. > >>> > > >>> > The kernel keeps emitting the following messages from time to time: > >>> > > Jun 16 17:25:29 gitega kernel: ceph: tid 211912 timed out on osd0, will > >>> > > reset osd > >>> > > Jun 16 17:25:35 gitega kernel: ceph: mon0 10.166.166.1:6789 connection > >>> > > failed > >>> > > Jun 16 17:26:15 gitega last message repeated 4 times > >>> > > >>> > I would have expected the "umount" to terminate at least after some generous > >>> > timeout. > >>> > > >>> > Ceph should probably support something like the "soft,intr" options > >>> > of NFS, because if the only supported way of mounting is one where > >>> > a client is more or less stuck-until-reboot when the service fails, > >>> > many potential test-configurations involving Ceph are way too dangerous > >>> > to try... > >>> > >>> Yeah, being able to force it to shut down when servers are unresponsive is > >>> definitely the intent. 'umount -f' should work. It sounds like the > >>> problem is related to the initial 'umount' (which doesn't time out) > >>> followed by 'umount -f'. > >>> > >>> I'm hesitant to add a blanket umount timeout, as that could prevent proper > >>> writeout of cached data/metadata in some cases. So I think the goal > >>> should be that if a normal umount hangs for some reason, you should be > >>> able to intervene to add the 'force' if things don't go well. > >>> > >>> sage > >>> -- > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >>> > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >