Re: [PATCH] ceph: add halt mount option support

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 20 Feb 2020 04:43:13 +0100

On Thu, Feb 20, 2020 at 12:49 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Wed, 2020-02-19 at 14:49 -0800, Patrick Donnelly wrote:
> > Responding to you and Ilya both:
> >
> > On Wed, Feb 19, 2020 at 1:21 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > On Wed, 2020-02-19 at 21:42 +0100, Ilya Dryomov wrote:
> > > > On Wed, Feb 19, 2020 at 8:22 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > > > > On Tue, Feb 18, 2020 at 6:59 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > > > > > > Yeah, I've mostly done this using DROP rules when I needed to test things.
> > > > > > > But, I think I was probably just guilty of speculating out loud here.
> > > > > >
> > > > > > I'm not sure what exactly Xiubo meant by "fulfilling" iptables rules
> > > > > > in libceph, but I will say that any kind of iptables manipulation from
> > > > > > within libceph is probably out of the question.
> > > > >
> > > > > I think we're getting confused about two thoughts on iptables: (1) to
> > > > > use iptables to effectively partition the mount instead of this new
> > > > > halt option; (2) use iptables in concert with halt to prevent FIN
> > > > > packets from being sent when the sockets are closed. I think we all
> > > > > agree (2) is not going to happen.
> > > >
> > > > Right.
> > > >
> > > > > > > I think doing this by just closing down the sockets is probably fine. I
> > > > > > > wouldn't pursue anything relating to to iptables here, unless we have
> > > > > > > some larger reason to go that route.
> > > > > >
> > > > > > IMO investing into a set of iptables and tc helpers for teuthology
> > > > > > makes a _lot_ of sense.  It isn't exactly the same as a cable pull,
> > > > > > but it's probably the next best thing.  First, it will be external to
> > > > > > the system under test.  Second, it can be made selective -- you can
> > > > > > cut a single session or all of them, simulate packet loss and latency
> > > > > > issues, etc.  Third, it can be used for recovery and failover/fencing
> > > > > > testing -- what happens when these packets get delivered two minutes
> > > > > > later?  None of this is possible with something that just attempts to
> > > > > > wedge the mount and acts as a point of no return.
> > > > >
> > > > > This sounds attractive but it does require each mount to have its own
> > > > > IP address? Or are there options? Maybe the kernel driver could mark
> > > > > the connection with a mount ID we could do filtering on it? From a
> > > > > quick Google, maybe [1] could be used for this purpose. I wonder
> > > > > however if the kernel driver would have to do that marking of the
> > > > > connection... and then we have iptables dependencies in the driver
> > > > > again which we don't want to do.
> > > >
> > > > As I said yesterday, I think it should be doable with no kernel
> > > > changes -- either with IP aliases or with the help of some virtual
> > > > interface.  Exactly how, I'm not sure because I use VMs for my tests
> > > > and haven't had to touch iptables in a while, but I would be surprised
> > > > to learn otherwise given the myriad of options out there.
> > > >
> > >
> > > ...and really, doing this sort of testing with the kernel client outside
> > > of a vm is sort of a mess anyway, IMO.
> >
> > Testing often involves making a mess :) I disagree in principle that
> > having a mechanism for stopping a netfs mount without pulling the plug
> > (virtually or otherwise) is unnecessary.
> >
>
> Ok, here are some more concerns:
>
> I'm not clear on what value this new mount option really adds. Once you
> do this, the client is hosed, so this is really only useful for testing
> the MDS. If your goal is to test the MDS with dying clients, then why
> not use a synthetic userland client to take state and do whatever you
> want?
>
> It could be I'm missing some value in using a kclient for this. If you
> did want to do this after all, then why are you keeping the mount around
> at all? It's useless after the remount, so you might as well just umount
> it.
>
> If you really want to make it just shut down the sockets, then you could
> add a new flag to umount2/sys_umount (UMOUNT_KILL or something) that
> would kill off the mount w/o talking to the MDS. That seems like a much
> cleaner interface than doing this.
>
> > > That said, I think we might need a way to match up a superblock with the
> > > sockets associated with it -- so mon, osd and mds socket info,
> > > basically. That could be a very simple thing in debugfs though, in the
> > > existing directory hierarchy there. With that info, you could reasonably
> > > do something with iptables like we're suggesting.
> >
> > That's certainly useful information to expose but I don't see how that
> > would help with constructing iptable rules. The kernel may reconnect
> > to any Ceph service at any time, especially during potential network
> > disruption (like an iptables rule dropping packets). Any rules you
> > construct for those connections would no longer apply. You cannot
> > construct rules that broadly apply to e.g. the entire ceph cluster as
> > a destination because it would interfere with other kernel client
> > mounts. I believe this is why Ilya is suggesting the use of virtual ip
> > addresses as a unique source address for each mount.
> >
>
> Sorry, braino -- sunrpc clients keep their source ports in most cases
> (for legacy reasons). I don't think libceph msgr does though. You're
> right that a debugfs info file won't really help.
>
> You could roll some sort of deep packet inspection to discern this but
> that's more difficult. I wonder if you could do it with BPF these days
> though...
>
> > > > > From my perspective, this halt patch looks pretty simple and doesn't
> > > > > appear to be a huge maintenance burden. Is it really so objectionable?
> > > >
> > > > Well, this patch is simple only because it isn't even remotely
> > > > equivalent to a cable pull.  I mean, it aborts in-flight requests
> > > > with EIO, closes sockets, etc.  Has it been tested against the test
> > > > cases that currently cold reset the node through the BMC?
> >
> > Of course not, this is the initial work soliciting feedback on the concept.
> >
>
> Yep. Don't get discouraged, I think we can do something to better
> accommodate testing, but I don't think this is the correct direction for
> it.
>
> > > > If it has been tested and the current semantics are sufficient,
> > > > are you sure they will remain so in the future?  What happens when
> > > > a new test gets added that needs a harder shutdown?  We won't be
> > > > able to reuse existing "umount -f" infrastructure anymore...  What
> > > > if a new test needs to _actually_ kill the client?
> > > >
> > > > And then a debugging knob that permanently wedges the client sure
> > > > can't be a mount option for all the obvious reasons.  This bit is easy
> > > > to fix, but the fact that it is submitted as a mount option makes me
> > > > suspect that the whole thing hasn't been thought through very well.
> >
> > Or, Xiubo needs advice on a better way to do it. In the tracker ticket
> > I suggested a sysfs control file. Would that be appropriate?
> >
>
> I'm not a fan of adding fault injection code to the client. I'd prefer
> doing this via some other mechanism. If you really do want something
> like this in the kernel, then you may want to consider something like
> BPF.
>
> > > Agreed on all points. This sort of fault injection is really best done
> > > via other means. Otherwise, it's really hard to know whether it'll
> > > behave the way you expect in other situations.
> > >
> > > I'll add too that I think experience shows that these sorts of
> > > interfaces end up bitrotted because they're too specialized to use
> > > outside of anything but very specific environments. We need to think
> > > larger than just teuthology's needs here.
> >
> > I doubt they'd become bitrotted with regular use in teuthology.
> >
>
> Well, certainly some uses of them might not, but interfaces like this
> need to be generically useful across a range of environments. I'm not
> terribly interested in plumbing something in that is _only_ used for
> teuthology, even as important as that use-case is.
>
> > I get that you both see VMs or virtual interfaces would obviate this
> > PR. VMs are not an option in teuthology. We can try to spend some time
> > on seeing if something like a bridged virtual network will work. Will
> > the kernel driver operate in the network namespace of the container
> > that mounts the volume?
> >
>
> That, I'm not sure about. I'm not sure if the sockets end up inheriting
> the net namespace of the mounting process. It'd be good to investigate
> this. You may be able to just get crafty with the unshare command to
> test it out.

It will -- I added that several years ago when docker started gaining
popularity.

This is why I keep mentioning virtual interfaces.  One thing that
will most likely work without any hiccups is a veth pair with one
interface in the namespace and one in the host plus a simple iptables
masquerading rule to NAT between the veth network and the world.
For cutting all sessions, you won't even need to touch iptables any
further: just down either end of the veth pair.

Doing it from the container would obviously work too, but further
iptables manipulation might be trickier because of more parts involved:
additional interfaces, bridge, iptables rules installed by the
container runtime, etc.

Thanks,

                Ilya