Re: Return stale file handle (-ESTALE) like NFS instead of hanging forever?

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 16 Dec 2011 09:00:03 -0800 (PST)

Hi Amon,

There is a long-standing item in the issue tracker to add a 'soft' mode to 
the kernel client that would be analogous to the NFS mount option.  I 
think this is basically what you're asking for.  The general problem is 
that timeouts in general lead to timing-dependent behavior and 
non-deterministic results.  Usually people would prefer 'correct' 
application behvior that waits for a long time over a shorter delay 
followed by spurios erros that the application may or may not handle 
properly.  So ESTALE will never be the default.. but it is something we 
can make optional.

If it is purely a matter of raising an appropriate alarm without 
accumulating blocked nagios checks, that's something we can more easily 
solve (e.g. by checking client health in debugfs).

FWIW, here is what the nfs(5) man page says about it:

                      NB:  A  so-called  "soft"  timeout can cause silent data
                      corruption in certain  cases.  As  such,  use  the  soft
                      option only when client responsiveness is more important
                      than data integrity.  Using NFS over TCP  or  increasing
                      the value of the retrans option may mitigate some of the
                      risks of using the soft option.

sage

On Fri, 16 Dec 2011, Amon Ott wrote:

> On Thursday 15 December 2011 wrote Gregory Farnum:
> > On Thu, Dec 8, 2011 at 5:55 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote:
> > > Hi folks,
> > >
> > > if file access through Ceph kernel client cannot continue, e.g. because
> > > there is no mds available, it hangs forever.
> > >
> > > I would prefer if after a timeout the application would get an error
> > > code, e.g. the -ESTALE that NFS and Gluster return if something goes
> > > wrong. This would allow for the application to handle the error instead
> > > of blocking forever without a chance to recover.
> >
> > This is interesting to me ? Ceph works very hard to provide POSIX
> > semantics and so philosophically the introduction of ESTALE returns is
> > not a natural thing for us.
> > That doesn't necessarily make it the wrong choice, but since Ceph's
> > systems are designed to be self-repairing the expectation is that any
> > outage is a temporary situation that will resolve itself pretty
> > quickly. And unlike NFS, which often returns ESTALE when other file
> > accesses might succeed, if Ceph fails on an MDS request that's pretty
> > much the ballgame. So returning ESTALE seems like it's a cop-out,
> > losing data and behaving unexpectedly without actually doing anything
> > to resolve the issues or giving other data a chance to get saved ? ie,
> > it's not something we want to do automatically. I believe we already
> > honor interrupts so that you can do things like Ctrl-C an application
> > waiting for IO and cancel the operations.
> 
> The assumption that any outage resolves itself quickly is unfortunately wrong. 
> During my tests over some months by now, I had several cases where Ceph 
> services on all nodes crashed one after another when trying to recover from 
> some error. This always resulted in infinite hangs for every single process 
> that tried to access something on Ceph FS, until I sent a kill manually. For 
> a big cluster this is not acceptable.
> 
> Ceph design is supposed to handle crashes of singles service instances, but a 
> bug in a service daemon that makes the same kind of service crash on all 
> nodes is always possible, because they have the same data. There needs to be 
> some reliable way to find out at the client side that access is not likely to 
> recover any time soon.
> 
> > Can you describe why this behavior interests you (and manual
> > interruption is insufficient)? I discussed with a few people the
> > possibility of making it an off-by-default mount option (though I'm
> > unclear on the technical difficulty involved; I'm not big on our
> > kernel stuff); presumably that would be enough for your purposes?
> 
> Our server clusters have quite a few cron jobs as well as Nagios health checks 
> that also access the common data area on Ceph FS for configuration and status 
> storage. If these jobs hang forever because of a blocked access, they cannot 
> finish their other tasks - even if that access is not vital for these other 
> tasks. Specially, they can never return a result. You cannot even shutdown 
> the system cleanly, if umount blocks forever.
> 
> The systems cannot help themselves, unless we add our own timeouts to any such 
> access and kill the processes after timeout. A meaningful error code instead 
> of unconditional kill would be very helpful, whatever code that is - we could 
> simply handle it in the scripts.
> 
> Please note that I am not talking about rather short term blocks like 10s or 
> even a minute, while the system recovers from some service crash. After 15 
> minutes or so I would rather get a meaningful alarm from some error instead 
> of a hanging check that tells me nothing. Additionally, this removes the need 
> to check for hanging previous checks before they use up too many system 
> resources.
> 
> Amon Ott
> -- 
> Dr. Amon Ott
> m-privacy GmbH           Tel: +49 30 24342334
> Am Köllnischen Park 1    Fax: +49 30 24342336
> 10179 Berlin             http://www.m-privacy.de
> 
> Amtsgericht Charlottenburg, HRB 84946
> 
> Geschäftsführer:
>  Dipl.-Kfm. Holger Maczkowsky,
>  Roman Maczkowsky
> 
> GnuPG-Key-ID: 0x2DD3A649
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>