Re: [RFC v4 2/2] file, epoll: Implement do_replace() and eventpoll_replace()

Tycho Andersen <tycho@tycho.pizza> · Fri, 24 Mar 2023 07:43:13 -0600

On Fri, Mar 24, 2023 at 09:23:44AM +0100, Christian Brauner wrote:
> On Thu, Mar 23, 2023 at 10:23:56PM +0000, Alok Tiagi wrote:
> > On Mon, Mar 20, 2023 at 03:51:03PM +0100, Christian Brauner wrote:
> > > On Sat, Mar 18, 2023 at 06:02:48AM +0000, aloktiagi wrote:
> > > > Introduce a mechanism to replace a file linked in the epoll interface or a file
> > > > that has been dup'ed by a file received via the replace_fd() interface.
> > > > 
> > > > eventpoll_replace() is called from do_replace() and finds all instances of the
> > > > file to be replaced and replaces them with the new file.
> > > > 
> > > > do_replace() also replaces the file in the file descriptor table for all fd
> > > > numbers referencing it with the new file.
> > > > 
> > > > We have a use case where multiple IPv6 only network namespaces can use a single
> > > > IPv4 network namespace for IPv4 only egress connectivity by switching their
> > > > sockets from IPv6 to IPv4 network namespace. This allows for migration of
> > > > systems to IPv6 only while keeping their connectivity to IPv4 only destinations
> > > > intact.
> > > > 
> > > > Today, we achieve this by setting up seccomp filter to intercept network system
> > > > calls like connect() from a container in a container manager which runs in an
> > > > IPv4 only network namespace. The container manager creates a new IPv4 connection
> > > > and injects the new file descriptor through SECCOMP_NOTIFY_IOCTL_ADDFD replacing
> > > > the original file descriptor from the connect() call. This does not work for
> > > > cases where the original file descriptor is handed off to a system like epoll
> > > > before the connect() call. After a new file descriptor is injected the original
> > > > file descriptor being referenced by the epoll fd is not longer valid leading to
> > > > failures. As a workaround the container manager when intercepting connect()
> > > > loops through all open socket file descriptors to check if they are referencing
> > > > the socket attempting the connect() and replace the reference with the to be
> > > > injected file descriptor. This workaround is cumbersome and makes the solution
> > > > prone to similar yet to be discovered issues.
> > > > 
> > > > The above change will enable us remove the workaround in the container manager
> > > > and let the kernel handle the replacement correctly.
> > > > 
> > > > Signed-off-by: aloktiagi <aloktiagi@xxxxxxxxx>
> > > > ---
> > > >  fs/eventpoll.c                                | 38 ++++++++
> > > >  fs/file.c                                     | 54 +++++++++++
> > > >  include/linux/eventpoll.h                     | 18 ++++
> > > >  include/linux/file.h                          |  1 +
> > > >  tools/testing/selftests/seccomp/seccomp_bpf.c | 97 +++++++++++++++++++
> > > >  5 files changed, 208 insertions(+)
> > > > 
> > > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > > > index 64659b110973..958ad995fd45 100644
> > > > --- a/fs/eventpoll.c
> > > > +++ b/fs/eventpoll.c
> > > > @@ -935,6 +935,44 @@ void eventpoll_release_file(struct file *file)
> > > >  	mutex_unlock(&epmutex);
> > > >  }
> > > >  
> > > > +static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
> > > > +			struct file *tfile, int fd, int full_check);
> > > > +
> > > > +/*
> > > > + * This is called from eventpoll_replace() to replace a linked file in the epoll
> > > > + * interface with a new file received from another process. This is useful in
> > > > + * cases where a process is trying to install a new file for an existing one
> > > > + * that is linked in the epoll interface
> > > > + */
> > > > +void eventpoll_replace_file(struct file *toreplace, struct file *file)
> > > > +{
> > > > +	int fd;
> > > > +	struct eventpoll *ep;
> > > > +	struct epitem *epi;
> > > > +	struct hlist_node *next;
> > > > +	struct epoll_event event;
> > > > +
> > > > +	if (!file_can_poll(file))
> > > > +		return;
> > > > +
> > > > +	mutex_lock(&epmutex);
> > > > +	if (unlikely(!toreplace->f_ep)) {
> > > > +		mutex_unlock(&epmutex);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	hlist_for_each_entry_safe(epi, next, toreplace->f_ep, fllink) {
> > > > +		ep = epi->ep;
> > > > +		mutex_lock(&ep->mtx);
> > > > +		fd = epi->ffd.fd;
> > > > +		event = epi->event;
> > > > +		ep_remove(ep, epi);
> > > > +		ep_insert(ep, &event, file, fd, 1);
> > > 
> > > So, ep_remove() can't fail but ep_insert() can. Maybe that doesn't
> > > matter...
> > > 
> > > > +		mutex_unlock(&ep->mtx);
> > > > +	}
> > > > +	mutex_unlock(&epmutex);
> > > > +}
> > > 
> > > Andrew carries a patchset that may impact the locking here:
> > > 
> > > https://lore.kernel.org/linux-fsdevel/323de732635cc3513c1837c6cbb98f012174f994.1678312201.git.pabeni@xxxxxxxxxx
> > > 
> > > I have to say that this whole thing has a very unpleasant taste to it.
> > > Replacing a single fd from seccomp is fine, wading through the fdtable
> > > to replace all fds referencing the same file is pretty nasty. Especially
> > > with the global epoll mutex involved in all of this.
> > > 
> > > And what limits this to epoll. I'm seriously asking aren't there
> > > potentially issues for fds somehow referenced in io_uring instances as
> > > well?
> > > 
> > > I'm not convinced this belongs here yet...
> > 
> > Thank you for reviewing and proving a link to Andrew's patch.
> > 
> > I think just replacing a single fd from seccomp leaves this feature in an
> > incomplete state. As a user of this feature, it means I need to figure out what
> > all file descriptors are referencing this file eg. epoll, dup'ed fds etc. This
> > patch is an attempt to complete this seccomp feature and also move the logic of
> > figuring out the references to the kernel.
> 
> I'm still not convinced.
> 
> You're changing the semantics of the replace file feature in seccomp
> drastically. Whereas now it means replace the file a single fd refers to
> you're now letting it replace multiple fds.

Yes; the crux of the patch is really the epoll part, not the multiple
replace part. IMO, we should drop the multiple replace bit, as I agree
it is a pretty big change.

The change in semantics w.r.t. epoll() (and eventually others),
though, is important. The way it currently works is not really
helpful.

Perhaps we could add a flag that people could set from SECCOMP_ADDFD
asking for this extra behavior?

> > 
> > The epmutex is taken when the file is replaced in the epoll interface. This is
> > similar to what would happen when eventpoll_release would be called for this
> > same file when it is ultimately released from __fput().
> > 
> > This is indeed not limited to epoll and the file descriptor table, but this
> > current patch addresses is limited to these interfaces. We can create a separate
> > one for io_uring.
> 
> The criteria for when it's sensible to update an fd to refer to the new
> file and when not are murky here and tailored to this very specific
> use-case. We shouldn't be involved in decisions like that. Userspace is
> in a much better position to know when it's sensible to replace an fd.
> 
> The fdtable is no place to get involved in a game of "if the fd is in a
> epoll instance, update the epoll instance, if it's in an io_uring
> instance, update the io_uring instance, ...". That's a complete
> layerying violation imho.
> 
> And even if you'd need to get sign off on this from epoll and io_uring
> folks as well before we can just reach into other subsytems from the
> fdtable.

Yep, agreed.

> I'm sorry but this all sounds messy. You can do this just fine in
> userspace, so please do it in userspace. This all sound very NAKable
> right now.

We have added lots of APIs for things that are possible from userspace
already that are made easier with a nice API. The seccomp forwarding
functionality itself, pidfd_getfd(), etc. I don't see this particular
bit as a strong argument against.

Tycho