On Fri, 23 Jul 2010, Sébastien Paolacci wrote: > Hello Sage, > > I would like to emphasize that this issue is somewhat annoying, even > for experiment purpose: I definitely expect my test server to not > behave safely, crash, burn or whatever, but having a client side > impact as deep as needed a (hard) reboot to solved a hanged ceph > really prevent me from testing with real life payloads. Maybe you can clarify for me exactly where the problem is. 'umount -f' should work. 'umount -l' should do a lazy unmount (detach from namespace), but the actual unmount code may currently hang. It's debateable how that can/should be solved, since it's the 'sync' stage that hangs, and it's not clear we should ever 'give up' on that without an administrator telling us to (*). What problem do you actually see, though? Why does it matter, or why do you care, if the 'umount -l' leaves some kernel threads trying to umount? Is it just annoying because it Shouldn't Do That, or does it actually cause a problem for you? It may be that if you try to remount the same fs, the old superblock gets reused, and the mount fails somehow... I haven't tried that. That would be an easy fix, though. Any clarification would be helpful! Thanks- sage * Maybe a hook like /sys/kernel/debug/ceph/.../abort_sync that you can echo 1 to would be sufficient to make it give up on a sync (in the umount -l case, the sync prior to the actual unmount). > > I understand that it's not an easy point but a lot of my colleagues > are not really whiling to sacrifice even their dev workstation to play > during spare time... sad world ;) > > Sebastien > > On Wed, 16 Jun 2010, Peter Niemayer wrote: > > Hi, > > > > trying to "umount" a formerly mounted ceph filesystem that has become > > unavailable (osd crashed, then msd/mon were shut down using /etc/init.d/ceph > > stop) results in "umount" hanging forever in > > "D" state. > > > > Strangely, "umount -f" started from another terminal reports > > the ceph filesystem as not being mounted anymore, which is consistent > > with what the mount-table says. > > > > The kernel keeps emitting the following messages from time to time: > > > Jun 16 17:25:29 gitega kernel: ceph: tid 211912 timed out on osd0, will > > > reset osd > > > Jun 16 17:25:35 gitega kernel: ceph: mon0 10.166.166.1:6789 connection > > > failed > > > Jun 16 17:26:15 gitega last message repeated 4 times > > > > I would have expected the "umount" to terminate at least after some generous > > timeout. > > > > Ceph should probably support something like the "soft,intr" options > > of NFS, because if the only supported way of mounting is one where > > a client is more or less stuck-until-reboot when the service fails, > > many potential test-configurations involving Ceph are way too dangerous > > to try... > > Yeah, being able to force it to shut down when servers are unresponsive is > definitely the intent. 'umount -f' should work. It sounds like the > problem is related to the initial 'umount' (which doesn't time out) > followed by 'umount -f'. > > I'm hesitant to add a blanket umount timeout, as that could prevent proper > writeout of cached data/metadata in some cases. So I think the goal > should be that if a normal umount hangs for some reason, you should be > able to intervene to add the 'force' if things don't go well. > > sage > -- > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >