Re: [PATCH 2/3] ceph: add method that forces client to reconnect using new entity addr

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 05 Jun 2019 19:43:28 -0400

On Wed, 2019-06-05 at 16:18 -0700, Patrick Donnelly wrote:
> Apologies for having this discussion in two threads...
> 
> On Wed, Jun 5, 2019 at 3:26 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Wed, 2019-06-05 at 14:57 -0700, Patrick Donnelly wrote:
> > > On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote:
> > > > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > > > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > > > > > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > > > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > > > > > all dirty state, how many applications would continue working without
> > > > > > > > some kind of restart?  And if you are restarting your application, why
> > > > > > > > not get a new mount?
> > > > > > > > 
> > > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > > > > > that much different from umount+mount?
> > > > > > > 
> > > > > > > People don't like it when their filesystem refuses to umount, which is
> > > > > > > what happens when the kernel client can't reconnect to the MDS right
> > > > > > > now. I'm not sure there's a practical way to deal with that besides
> > > > > > > some kind of computer admin intervention.
> > > > > > 
> > > > > > Furthermore, there are often many applications using the mount (even
> > > > > > with containers) and it's not a sustainable position that any
> > > > > > network/client/cephfs hiccup requires a remount. Also, an application
> > > > > 
> > > > > Well, it's not just any hiccup.  It's one that lead to blacklisting...
> > > > > 
> > > > > > that fails because of EIO is easy to deal with a layer above but a
> > > > > > remount usually requires grump admin intervention.
> > > > > 
> > > > > I feel like I'm missing something here.  Would figuring out $ID,
> > > > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the
> > > > > admin less grumpy, especially when containers are involved?
> > > > > 
> > > > > Doing the force_reconnect thing would retain the mount point, but how
> > > > > much use would it be?  Would using existing (i.e. pre-blacklist) file
> > > > > descriptors be allowed?  I assumed it wouldn't be (permanent EIO or
> > > > > something of that sort), so maybe that is the piece I'm missing...
> > > > > 
> > > > 
> > > > I agree with Ilya here. I don't see how applications can just pick up
> > > > where they left off after being blacklisted. Remounting in some fashion
> > > > is really the only recourse here.
> > > > 
> > > > To be clear, what happens to stateful objects (open files, byte-range
> > > > locks, etc.) in this scenario? Were you planning to just re-open files
> > > > and re-request locks that you held before being blacklisted? If so, that
> > > > sounds like a great way to cause some silent data corruption...
> > > 
> > > The plan is:
> > > 
> > > - files open for reading re-obtain caps and may continue to be used
> > > - files open for writing discard all dirty file blocks and return -EIO
> > > on further use (this could be configurable via a mount_option like
> > > with the ceph-fuse client)
> > > 
> > 
> > That sounds fairly reasonable.
> > 
> > > Not sure how best to handle locks and I'm open to suggestions. We
> > > could raise SIGLOST on those processes?
> > > 
> > 
> > Unfortunately, SIGLOST has never really been a thing on Linux. There was
> > an attempt by Anna Schumaker a few years ago to implement it for use
> > with NFS, but it never went in.
> 
> Is there another signal we could reasonably use?
> 

Not really. The problem is really that SIGLOST is not even defined. In
fact, if you look at the asm-generic/signal.h header:

#define SIGIO           29                                                      
#define SIGPOLL         SIGIO                                                   
/*                                                                              
#define SIGLOST         29                                                      
*/                                                                              

So, there it is, commented out, and it shares a value with SIGIO. We
could pick another value for it, of course, but then you'd have to get
it into userland headers too. All of that sounds like a giant PITA.

> > We ended up with this patch, IIRC:
> > 
> >     https://patchwork.kernel.org/patch/10108419/
> > 
> > "The current practice is to set NFS_LOCK_LOST so that read/write returns
> >  EIO when a lock is lost. So, change these comments to code when sets
> > NFS_LOCK_LOST."
> > 
> > Maybe we should aim for similar behavior in this situation. It's a
> > little tricker here since we don't really have an analogue to a lock
> > stateid in ceph, so we'd need to implement this in some other way.
> 
> So effectively blacklist the process so all I/O is blocked on the
> mount? Do I understand correctly?
> 

No. I think in practice what we'd want to do is "invalidate" any file
descriptions that were open before the blacklisting where locks were
lost. Attempts to do reads or writes against those fd's would get back
an error (EIO, most likely).

File descriptions that didn't have any lost locks could carry on working
as normal after reacquiring caps. We could also consider a module
parameter or something to allow reclaim of lost locks too (in violation
of continuity rules), like the recover_lost_locks parameter in nfs.ko.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>