Re: [PATCH] ceph: add halt mount option support

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 19 Feb 2020 18:49:42 -0500

On Wed, 2020-02-19 at 14:49 -0800, Patrick Donnelly wrote:
> Responding to you and Ilya both:
> 
> On Wed, Feb 19, 2020 at 1:21 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Wed, 2020-02-19 at 21:42 +0100, Ilya Dryomov wrote:
> > > On Wed, Feb 19, 2020 at 8:22 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > > > On Tue, Feb 18, 2020 at 6:59 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > > > > > Yeah, I've mostly done this using DROP rules when I needed to test things.
> > > > > > But, I think I was probably just guilty of speculating out loud here.
> > > > > 
> > > > > I'm not sure what exactly Xiubo meant by "fulfilling" iptables rules
> > > > > in libceph, but I will say that any kind of iptables manipulation from
> > > > > within libceph is probably out of the question.
> > > > 
> > > > I think we're getting confused about two thoughts on iptables: (1) to
> > > > use iptables to effectively partition the mount instead of this new
> > > > halt option; (2) use iptables in concert with halt to prevent FIN
> > > > packets from being sent when the sockets are closed. I think we all
> > > > agree (2) is not going to happen.
> > > 
> > > Right.
> > > 
> > > > > > I think doing this by just closing down the sockets is probably fine. I
> > > > > > wouldn't pursue anything relating to to iptables here, unless we have
> > > > > > some larger reason to go that route.
> > > > > 
> > > > > IMO investing into a set of iptables and tc helpers for teuthology
> > > > > makes a _lot_ of sense.  It isn't exactly the same as a cable pull,
> > > > > but it's probably the next best thing.  First, it will be external to
> > > > > the system under test.  Second, it can be made selective -- you can
> > > > > cut a single session or all of them, simulate packet loss and latency
> > > > > issues, etc.  Third, it can be used for recovery and failover/fencing
> > > > > testing -- what happens when these packets get delivered two minutes
> > > > > later?  None of this is possible with something that just attempts to
> > > > > wedge the mount and acts as a point of no return.
> > > > 
> > > > This sounds attractive but it does require each mount to have its own
> > > > IP address? Or are there options? Maybe the kernel driver could mark
> > > > the connection with a mount ID we could do filtering on it? From a
> > > > quick Google, maybe [1] could be used for this purpose. I wonder
> > > > however if the kernel driver would have to do that marking of the
> > > > connection... and then we have iptables dependencies in the driver
> > > > again which we don't want to do.
> > > 
> > > As I said yesterday, I think it should be doable with no kernel
> > > changes -- either with IP aliases or with the help of some virtual
> > > interface.  Exactly how, I'm not sure because I use VMs for my tests
> > > and haven't had to touch iptables in a while, but I would be surprised
> > > to learn otherwise given the myriad of options out there.
> > > 
> > 
> > ...and really, doing this sort of testing with the kernel client outside
> > of a vm is sort of a mess anyway, IMO.
> 
> Testing often involves making a mess :) I disagree in principle that
> having a mechanism for stopping a netfs mount without pulling the plug
> (virtually or otherwise) is unnecessary.
> 

Ok, here are some more concerns:

I'm not clear on what value this new mount option really adds. Once you
do this, the client is hosed, so this is really only useful for testing
the MDS. If your goal is to test the MDS with dying clients, then why
not use a synthetic userland client to take state and do whatever you
want?

It could be I'm missing some value in using a kclient for this. If you
did want to do this after all, then why are you keeping the mount around
at all? It's useless after the remount, so you might as well just umount
it.

If you really want to make it just shut down the sockets, then you could
add a new flag to umount2/sys_umount (UMOUNT_KILL or something) that
would kill off the mount w/o talking to the MDS. That seems like a much
cleaner interface than doing this.

> > That said, I think we might need a way to match up a superblock with the
> > sockets associated with it -- so mon, osd and mds socket info,
> > basically. That could be a very simple thing in debugfs though, in the
> > existing directory hierarchy there. With that info, you could reasonably
> > do something with iptables like we're suggesting.
> 
> That's certainly useful information to expose but I don't see how that
> would help with constructing iptable rules. The kernel may reconnect
> to any Ceph service at any time, especially during potential network
> disruption (like an iptables rule dropping packets). Any rules you
> construct for those connections would no longer apply. You cannot
> construct rules that broadly apply to e.g. the entire ceph cluster as
> a destination because it would interfere with other kernel client
> mounts. I believe this is why Ilya is suggesting the use of virtual ip
> addresses as a unique source address for each mount.
> 

Sorry, braino -- sunrpc clients keep their source ports in most cases
(for legacy reasons). I don't think libceph msgr does though. You're
right that a debugfs info file won't really help.

You could roll some sort of deep packet inspection to discern this but
that's more difficult. I wonder if you could do it with BPF these days
though...

> > > > From my perspective, this halt patch looks pretty simple and doesn't
> > > > appear to be a huge maintenance burden. Is it really so objectionable?
> > > 
> > > Well, this patch is simple only because it isn't even remotely
> > > equivalent to a cable pull.  I mean, it aborts in-flight requests
> > > with EIO, closes sockets, etc.  Has it been tested against the test
> > > cases that currently cold reset the node through the BMC?
> 
> Of course not, this is the initial work soliciting feedback on the concept.
> 

Yep. Don't get discouraged, I think we can do something to better
accommodate testing, but I don't think this is the correct direction for
it.

> > > If it has been tested and the current semantics are sufficient,
> > > are you sure they will remain so in the future?  What happens when
> > > a new test gets added that needs a harder shutdown?  We won't be
> > > able to reuse existing "umount -f" infrastructure anymore...  What
> > > if a new test needs to _actually_ kill the client?
> > > 
> > > And then a debugging knob that permanently wedges the client sure
> > > can't be a mount option for all the obvious reasons.  This bit is easy
> > > to fix, but the fact that it is submitted as a mount option makes me
> > > suspect that the whole thing hasn't been thought through very well.
> 
> Or, Xiubo needs advice on a better way to do it. In the tracker ticket
> I suggested a sysfs control file. Would that be appropriate?
> 

I'm not a fan of adding fault injection code to the client. I'd prefer
doing this via some other mechanism. If you really do want something
like this in the kernel, then you may want to consider something like
BPF.

> > Agreed on all points. This sort of fault injection is really best done
> > via other means. Otherwise, it's really hard to know whether it'll
> > behave the way you expect in other situations.
> > 
> > I'll add too that I think experience shows that these sorts of
> > interfaces end up bitrotted because they're too specialized to use
> > outside of anything but very specific environments. We need to think
> > larger than just teuthology's needs here.
> 
> I doubt they'd become bitrotted with regular use in teuthology.
> 

Well, certainly some uses of them might not, but interfaces like this
need to be generically useful across a range of environments. I'm not
terribly interested in plumbing something in that is _only_ used for
teuthology, even as important as that use-case is.

> I get that you both see VMs or virtual interfaces would obviate this
> PR. VMs are not an option in teuthology. We can try to spend some time
> on seeing if something like a bridged virtual network will work. Will
> the kernel driver operate in the network namespace of the container
> that mounts the volume?
> 

That, I'm not sure about. I'm not sure if the sockets end up inheriting
the net namespace of the mounting process. It'd be good to investigate
this. You may be able to just get crafty with the unshare command to
test it out.

In any case, I'd rather see work on adding support for namespace
awareness to enable stuff like this than adding fault injection knobs. I
think that's something that could be generically useful.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>