Re: [PATCH 2/3] ceph: add method that forces client to reconnect using new entity addr

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Wed, 5 Jun 2019 16:18:37 -0700

Apologies for having this discussion in two threads...

On Wed, Jun 5, 2019 at 3:26 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Wed, 2019-06-05 at 14:57 -0700, Patrick Donnelly wrote:
> > On Tue, Jun 4, 2019 at 3:51 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > On Tue, 2019-06-04 at 11:37 +0200, Ilya Dryomov wrote:
> > > > On Mon, Jun 3, 2019 at 11:05 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > > > > On Mon, Jun 3, 2019 at 1:24 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > > > > > On Mon, Jun 3, 2019 at 1:07 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > > > > > > Can we also discuss how useful is allowing to recover a mount after it
> > > > > > > has been blacklisted?  After we fail everything with EIO and throw out
> > > > > > > all dirty state, how many applications would continue working without
> > > > > > > some kind of restart?  And if you are restarting your application, why
> > > > > > > not get a new mount?
> > > > > > >
> > > > > > > IOW what is the use case for introducing a new debugfs knob that isn't
> > > > > > > that much different from umount+mount?
> > > > > >
> > > > > > People don't like it when their filesystem refuses to umount, which is
> > > > > > what happens when the kernel client can't reconnect to the MDS right
> > > > > > now. I'm not sure there's a practical way to deal with that besides
> > > > > > some kind of computer admin intervention.
> > > > >
> > > > > Furthermore, there are often many applications using the mount (even
> > > > > with containers) and it's not a sustainable position that any
> > > > > network/client/cephfs hiccup requires a remount. Also, an application
> > > >
> > > > Well, it's not just any hiccup.  It's one that lead to blacklisting...
> > > >
> > > > > that fails because of EIO is easy to deal with a layer above but a
> > > > > remount usually requires grump admin intervention.
> > > >
> > > > I feel like I'm missing something here.  Would figuring out $ID,
> > > > obtaining root and echoing to /sys/kernel/debug/$ID/control make the
> > > > admin less grumpy, especially when containers are involved?
> > > >
> > > > Doing the force_reconnect thing would retain the mount point, but how
> > > > much use would it be?  Would using existing (i.e. pre-blacklist) file
> > > > descriptors be allowed?  I assumed it wouldn't be (permanent EIO or
> > > > something of that sort), so maybe that is the piece I'm missing...
> > > >
> > >
> > > I agree with Ilya here. I don't see how applications can just pick up
> > > where they left off after being blacklisted. Remounting in some fashion
> > > is really the only recourse here.
> > >
> > > To be clear, what happens to stateful objects (open files, byte-range
> > > locks, etc.) in this scenario? Were you planning to just re-open files
> > > and re-request locks that you held before being blacklisted? If so, that
> > > sounds like a great way to cause some silent data corruption...
> >
> > The plan is:
> >
> > - files open for reading re-obtain caps and may continue to be used
> > - files open for writing discard all dirty file blocks and return -EIO
> > on further use (this could be configurable via a mount_option like
> > with the ceph-fuse client)
> >
>
> That sounds fairly reasonable.
>
> > Not sure how best to handle locks and I'm open to suggestions. We
> > could raise SIGLOST on those processes?
> >
>
> Unfortunately, SIGLOST has never really been a thing on Linux. There was
> an attempt by Anna Schumaker a few years ago to implement it for use
> with NFS, but it never went in.

Is there another signal we could reasonably use?

> We ended up with this patch, IIRC:
>
>     https://patchwork.kernel.org/patch/10108419/
>
> "The current practice is to set NFS_LOCK_LOST so that read/write returns
>  EIO when a lock is lost. So, change these comments to code when sets
> NFS_LOCK_LOST."
>
> Maybe we should aim for similar behavior in this situation. It's a
> little tricker here since we don't really have an analogue to a lock
> stateid in ceph, so we'd need to implement this in some other way.

So effectively blacklist the process so all I/O is blocked on the
mount? Do I understand correctly?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D