Re: [PATCH v3 0/3] introduce a uid/gid shifting bind mount

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Wed, 19 Feb 2020 14:26:45 -0800

On Wed, 2020-02-19 at 14:26 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 03:43:18PM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> > > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> > 
> > [...]
> > > > > But way more important: what Amir got right is that your
> > > > > approach and fsid mappings don't stand in each others way at
> > > > > all. Shiftfed bind-mounts can be implemented completely
> > > > > independent of fsid mappings after the fact on top of it.
> > > > > 
> > > > > Your example, does this:
> > > > > 
> > > > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not
> > > > > O_PATH
> > > > > */
> > > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > > > 
> > > > > as the ultimate step. Essentially marking a mountpoint as
> > > > > shifted relative to that user namespace. Once fsid mappings
> > > > > are in all that you need to do is replace your
> > > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > > > your patchset with
> > > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're
> > > > > done. So I honestly don't currently see any need to block the
> > > > > patchsets on each other. 
> > > > 
> > > > Can I repeat: there's no rush to get upstream on this.  Let's
> > > > pause to get the kernel implementation (the thing we have to
> > > > maintain) right.  I realise we could each work around the other
> > > > and get our implementations bent around each other so they all
> > > > work independently thus making our disjoint user cabals happy
> > > > but I don't think that would lead to the best outcome for
> > > > kernel maintainability.
> > > 
> > > We have had the discussion with all major stakeholders in a
> > > single room on what we need at LPC 2019.
> > 
> > Well, you didn't invite me, so I think "stakeholders" means people
> > we
> 
> I'm confused as you were in the room with everyone. It's even on the
> schedule under your name:
> https://linuxplumbersconf.org/event/4/contributions/474/
> 
> > selected because we like their use case.  More importantly:
> > "stakeholders" traditionally means not only people who want to
> > consume the feature but also people who have to maintain it ... how
> > many VFS stakeholders were present?
> 
> Again, I'm confused you were in the room with at least David and
> Eric.

OK, I think we both got different things out of this, but I've now put
the videos for this on the LPC site so others can judge for themselves.
 The one for this session is here:

https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2h12m52s

Although it does make more sense if you watch the Dave Howells session
on the new mount API first:

https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h45m54s

There is lots of discussion of the shared image use case, but nowhere
is there any discussion of separating uid from fsuid in the user
namespace, unless I missed it again in my second viewing?

> In any case, the patches are on the relevant lists. They are actively
> being discussed and visible to everyone and we'll have time for
> proper review which is already happening.

The big issue, though, is that while you've produced a patch set that
covers your use case (shared unprivileged images) it doesn't cover mine
(privileged images).  However, the patch set I produced does cover both
use cases.  Safely handling privileged images is a much bigger security
problem than unprivileged ones, which is why I have more VFS code than
you do.  While I could, in theory, remove your use case from my patch,
doing so would really only reduce it by 38 lines (this is the diffstat
going from v2 to v3).  However, I still need all the code in the v2
patch to handle the backshifts needed for safe privileged images and I
pretty much don't use any of the fsid code you add because I need an
unprivileged fsid and uid, so the shifts are always identical for both
... which is the default before your patch.

This is what I mean about us getting the VFS implementation correct:  I
think I can sweep up s_user_ns into my patch (unproven because I've yet
to write the patch), effectively making it use the mnt_userns I
introduced and eliminating the superblock additions and, for 38
additional VFS lines, I also pick up your use case ... although there
may be subtleties I've missed.

I also think the way I coded it is much easier to use for an
orchestration system.  Basically you create a user_ns for the unpacker,
rendering it unprivileged, and it unpacks your images into the
filesystem so they also become unprivileged and globally shifted.  Then
whenever you use these images for a container with a separate uid
shift, you pass the unpacker userns with the "ns" argument and bind
mount the root into the container's userns: the container gets fake
root and any writes by fake root are shifted correctly into the
unprivileged image.  Everything just works and there's no need to
bother with fsid's.

However, what's still not completely done by any of us is convincing
the VFS community that we've got the correct and maintainable way of
adding all this to their VFS layer.

James