On Thursday 15 December 2011 wrote Gregory Farnum: > On Thu, Dec 8, 2011 at 5:55 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote: > > Hi folks, > > > > if file access through Ceph kernel client cannot continue, e.g. because > > there is no mds available, it hangs forever. > > > > I would prefer if after a timeout the application would get an error > > code, e.g. the -ESTALE that NFS and Gluster return if something goes > > wrong. This would allow for the application to handle the error instead > > of blocking forever without a chance to recover. > > This is interesting to me — Ceph works very hard to provide POSIX > semantics and so philosophically the introduction of ESTALE returns is > not a natural thing for us. > That doesn't necessarily make it the wrong choice, but since Ceph's > systems are designed to be self-repairing the expectation is that any > outage is a temporary situation that will resolve itself pretty > quickly. And unlike NFS, which often returns ESTALE when other file > accesses might succeed, if Ceph fails on an MDS request that's pretty > much the ballgame. So returning ESTALE seems like it's a cop-out, > losing data and behaving unexpectedly without actually doing anything > to resolve the issues or giving other data a chance to get saved — ie, > it's not something we want to do automatically. I believe we already > honor interrupts so that you can do things like Ctrl-C an application > waiting for IO and cancel the operations. The assumption that any outage resolves itself quickly is unfortunately wrong. During my tests over some months by now, I had several cases where Ceph services on all nodes crashed one after another when trying to recover from some error. This always resulted in infinite hangs for every single process that tried to access something on Ceph FS, until I sent a kill manually. For a big cluster this is not acceptable. Ceph design is supposed to handle crashes of singles service instances, but a bug in a service daemon that makes the same kind of service crash on all nodes is always possible, because they have the same data. There needs to be some reliable way to find out at the client side that access is not likely to recover any time soon. > Can you describe why this behavior interests you (and manual > interruption is insufficient)? I discussed with a few people the > possibility of making it an off-by-default mount option (though I'm > unclear on the technical difficulty involved; I'm not big on our > kernel stuff); presumably that would be enough for your purposes? Our server clusters have quite a few cron jobs as well as Nagios health checks that also access the common data area on Ceph FS for configuration and status storage. If these jobs hang forever because of a blocked access, they cannot finish their other tasks - even if that access is not vital for these other tasks. Specially, they can never return a result. You cannot even shutdown the system cleanly, if umount blocks forever. The systems cannot help themselves, unless we add our own timeouts to any such access and kill the processes after timeout. A meaningful error code instead of unconditional kill would be very helpful, whatever code that is - we could simply handle it in the scripts. Please note that I am not talking about rather short term blocks like 10s or even a minute, while the system recovers from some service crash. After 15 minutes or so I would rather get a meaningful alarm from some error instead of a hanging check that tells me nothing. Additionally, this removes the need to check for hanging previous checks before they use up too many system resources. Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1 Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html