Re: Return stale file handle (-ESTALE) like NFS instead of hanging forever?

Amon Ott <a.ott@xxxxxxxxxxxx> · Fri, 16 Dec 2011 09:36:34 +0100

On Thursday 15 December 2011 wrote Gregory Farnum:
> On Thu, Dec 8, 2011 at 5:55 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote:
> > Hi folks,
> >
> > if file access through Ceph kernel client cannot continue, e.g. because
> > there is no mds available, it hangs forever.
> >
> > I would prefer if after a timeout the application would get an error
> > code, e.g. the -ESTALE that NFS and Gluster return if something goes
> > wrong. This would allow for the application to handle the error instead
> > of blocking forever without a chance to recover.
>
> This is interesting to me — Ceph works very hard to provide POSIX
> semantics and so philosophically the introduction of ESTALE returns is
> not a natural thing for us.
> That doesn't necessarily make it the wrong choice, but since Ceph's
> systems are designed to be self-repairing the expectation is that any
> outage is a temporary situation that will resolve itself pretty
> quickly. And unlike NFS, which often returns ESTALE when other file
> accesses might succeed, if Ceph fails on an MDS request that's pretty
> much the ballgame. So returning ESTALE seems like it's a cop-out,
> losing data and behaving unexpectedly without actually doing anything
> to resolve the issues or giving other data a chance to get saved — ie,
> it's not something we want to do automatically. I believe we already
> honor interrupts so that you can do things like Ctrl-C an application
> waiting for IO and cancel the operations.

The assumption that any outage resolves itself quickly is unfortunately wrong. 
During my tests over some months by now, I had several cases where Ceph 
services on all nodes crashed one after another when trying to recover from 
some error. This always resulted in infinite hangs for every single process 
that tried to access something on Ceph FS, until I sent a kill manually. For 
a big cluster this is not acceptable.

Ceph design is supposed to handle crashes of singles service instances, but a 
bug in a service daemon that makes the same kind of service crash on all 
nodes is always possible, because they have the same data. There needs to be 
some reliable way to find out at the client side that access is not likely to 
recover any time soon.

> Can you describe why this behavior interests you (and manual
> interruption is insufficient)? I discussed with a few people the
> possibility of making it an off-by-default mount option (though I'm
> unclear on the technical difficulty involved; I'm not big on our
> kernel stuff); presumably that would be enough for your purposes?

Our server clusters have quite a few cron jobs as well as Nagios health checks 
that also access the common data area on Ceph FS for configuration and status 
storage. If these jobs hang forever because of a blocked access, they cannot 
finish their other tasks - even if that access is not vital for these other 
tasks. Specially, they can never return a result. You cannot even shutdown 
the system cleanly, if umount blocks forever.

The systems cannot help themselves, unless we add our own timeouts to any such 
access and kill the processes after timeout. A meaningful error code instead 
of unconditional kill would be very helpful, whatever code that is - we could 
simply handle it in the scripts.

Please note that I am not talking about rather short term blocks like 10s or 
even a minute, while the system recovers from some service crash. After 15 
minutes or so I would rather get a meaningful alarm from some error instead 
of a hanging check that tells me nothing. Additionally, this removes the need 
to check for hanging previous checks before they use up too many system 
resources.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html