Re: [PATCH] ceph: add halt mount option support

Xiubo Li <xiubli@xxxxxxxxxx> · Thu, 27 Feb 2020 21:19:31 +0800

On 2020/2/20 11:43, Ilya Dryomov wrote:
On Thu, Feb 20, 2020 at 12:49 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
On Wed, 2020-02-19 at 14:49 -0800, Patrick Donnelly wrote:
Responding to you and Ilya both:

On Wed, Feb 19, 2020 at 1:21 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
On Wed, 2020-02-19 at 21:42 +0100, Ilya Dryomov wrote:
On Wed, Feb 19, 2020 at 8:22 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
On Tue, Feb 18, 2020 at 6:59 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
Yeah, I've mostly done this using DROP rules when I needed to test things.
But, I think I was probably just guilty of speculating out loud here.
I'm not sure what exactly Xiubo meant by "fulfilling" iptables rules
in libceph, but I will say that any kind of iptables manipulation from
within libceph is probably out of the question.
I think we're getting confused about two thoughts on iptables: (1) to
use iptables to effectively partition the mount instead of this new
halt option; (2) use iptables in concert with halt to prevent FIN
packets from being sent when the sockets are closed. I think we all
agree (2) is not going to happen.
Right.

I think doing this by just closing down the sockets is probably fine. I
wouldn't pursue anything relating to to iptables here, unless we have
some larger reason to go that route.
IMO investing into a set of iptables and tc helpers for teuthology
makes a _lot_ of sense.  It isn't exactly the same as a cable pull,
but it's probably the next best thing.  First, it will be external to
the system under test.  Second, it can be made selective -- you can
cut a single session or all of them, simulate packet loss and latency
issues, etc.  Third, it can be used for recovery and failover/fencing
testing -- what happens when these packets get delivered two minutes
later?  None of this is possible with something that just attempts to
wedge the mount and acts as a point of no return.
This sounds attractive but it does require each mount to have its own
IP address? Or are there options? Maybe the kernel driver could mark
the connection with a mount ID we could do filtering on it? From a
quick Google, maybe [1] could be used for this purpose. I wonder
however if the kernel driver would have to do that marking of the
connection... and then we have iptables dependencies in the driver
again which we don't want to do.
As I said yesterday, I think it should be doable with no kernel
changes -- either with IP aliases or with the help of some virtual
interface.  Exactly how, I'm not sure because I use VMs for my tests
and haven't had to touch iptables in a while, but I would be surprised
to learn otherwise given the myriad of options out there.

...and really, doing this sort of testing with the kernel client outside
of a vm is sort of a mess anyway, IMO.
Testing often involves making a mess :) I disagree in principle that
having a mechanism for stopping a netfs mount without pulling the plug
(virtually or otherwise) is unnecessary.

Ok, here are some more concerns:

I'm not clear on what value this new mount option really adds. Once you
do this, the client is hosed, so this is really only useful for testing
the MDS. If your goal is to test the MDS with dying clients, then why
not use a synthetic userland client to take state and do whatever you
want?

It could be I'm missing some value in using a kclient for this. If you
did want to do this after all, then why are you keeping the mount around
at all? It's useless after the remount, so you might as well just umount
it.

If you really want to make it just shut down the sockets, then you could
add a new flag to umount2/sys_umount (UMOUNT_KILL or something) that
would kill off the mount w/o talking to the MDS. That seems like a much
cleaner interface than doing this.

That said, I think we might need a way to match up a superblock with the
sockets associated with it -- so mon, osd and mds socket info,
basically. That could be a very simple thing in debugfs though, in the
existing directory hierarchy there. With that info, you could reasonably
do something with iptables like we're suggesting.
That's certainly useful information to expose but I don't see how that
would help with constructing iptable rules. The kernel may reconnect
to any Ceph service at any time, especially during potential network
disruption (like an iptables rule dropping packets). Any rules you
construct for those connections would no longer apply. You cannot
construct rules that broadly apply to e.g. the entire ceph cluster as
a destination because it would interfere with other kernel client
mounts. I believe this is why Ilya is suggesting the use of virtual ip
addresses as a unique source address for each mount.

Sorry, braino -- sunrpc clients keep their source ports in most cases
(for legacy reasons). I don't think libceph msgr does though. You're
right that a debugfs info file won't really help.

You could roll some sort of deep packet inspection to discern this but
that's more difficult. I wonder if you could do it with BPF these days
though...

 From my perspective, this halt patch looks pretty simple and doesn't
appear to be a huge maintenance burden. Is it really so objectionable?
Well, this patch is simple only because it isn't even remotely
equivalent to a cable pull.  I mean, it aborts in-flight requests
with EIO, closes sockets, etc.  Has it been tested against the test
cases that currently cold reset the node through the BMC?
Of course not, this is the initial work soliciting feedback on the concept.

Yep. Don't get discouraged, I think we can do something to better
accommodate testing, but I don't think this is the correct direction for
it.

If it has been tested and the current semantics are sufficient,
are you sure they will remain so in the future?  What happens when
a new test gets added that needs a harder shutdown?  We won't be
able to reuse existing "umount -f" infrastructure anymore...  What
if a new test needs to _actually_ kill the client?

And then a debugging knob that permanently wedges the client sure
can't be a mount option for all the obvious reasons.  This bit is easy
to fix, but the fact that it is submitted as a mount option makes me
suspect that the whole thing hasn't been thought through very well.
Or, Xiubo needs advice on a better way to do it. In the tracker ticket
I suggested a sysfs control file. Would that be appropriate?

I'm not a fan of adding fault injection code to the client. I'd prefer
doing this via some other mechanism. If you really do want something
like this in the kernel, then you may want to consider something like
BPF.

Agreed on all points. This sort of fault injection is really best done
via other means. Otherwise, it's really hard to know whether it'll
behave the way you expect in other situations.

I'll add too that I think experience shows that these sorts of
interfaces end up bitrotted because they're too specialized to use
outside of anything but very specific environments. We need to think
larger than just teuthology's needs here.
I doubt they'd become bitrotted with regular use in teuthology.

Well, certainly some uses of them might not, but interfaces like this
need to be generically useful across a range of environments. I'm not
terribly interested in plumbing something in that is _only_ used for
teuthology, even as important as that use-case is.

I get that you both see VMs or virtual interfaces would obviate this
PR. VMs are not an option in teuthology. We can try to spend some time
on seeing if something like a bridged virtual network will work. Will
the kernel driver operate in the network namespace of the container
that mounts the volume?

That, I'm not sure about. I'm not sure if the sockets end up inheriting
the net namespace of the mounting process. It'd be good to investigate
this. You may be able to just get crafty with the unshare command to
test it out.
It will -- I added that several years ago when docker started gaining
popularity.

This is why I keep mentioning virtual interfaces.  One thing that
will most likely work without any hiccups is a veth pair with one
interface in the namespace and one in the host plus a simple iptables
masquerading rule to NAT between the veth network and the world.
For cutting all sessions, you won't even need to touch iptables any
further: just down either end of the veth pair.

Doing it from the container would obviously work too, but further
iptables manipulation might be trickier because of more parts involved:
additional interfaces, bridge, iptables rules installed by the
container runtime, etc.

Hi Ilya, Jeff, Patrick

Thanks for your advice and great idea of this.

I started it with ceph-fuse, and the patch is ready, please see 
https://github.com/ceph/ceph/pull/33576.

Thanks
BRs
Xiubo

Thanks,

                 Ilya