Re: [PATCH v2 RFC] userns: Convert xfs to use kuid/kgid where appropriate

Dwight Engen <dwight.engen@xxxxxxxxxx> · Wed, 26 Jun 2013 17:30:17 -0400

On Wed, 26 Jun 2013 12:09:24 +1000
Dave Chinner <david@xxxxxxxxxxxxx> wrote:

> On Mon, Jun 24, 2013 at 09:10:35AM -0400, Dwight Engen wrote:
> > Hi Dave, all. Here is a v2 patch that I believe addresses the
> > previous comments, but I expect there to be more :) I think there
> > are a few more issues to sort out before this is ready, and I want
> > to add some tests to xfstests also.
> > 
> > I added permission checks for eofblocks in the ioctl code, but
> > I don't think they are enough. Just because an unprivileged
> > caller is in a group doesn't mean he can write to a file of that
> > group, and I don't know how we can check for that till we get the
> > inode in hand. Brian, if you or anyone else could comment on how
> > this should work for the regular user and write() ENOSPC cases
> > that'd be great.
> > 
> > The xfs code now uses inode->i_uid where possible instead of di_uid.
> > The remaining uses of di_uid are where the inode is being setup,
> > conversion to/from disk endianess, in dealing with quotas, and
> > bulkstat.
> 
> Hi Dwight,
> 
> I haven't looked at the code in detail, I'll just mention that I
> agree with Brain that we need to split this up into more patches.
> The conversions are subtle and easy to misunderstand, so small
> patches are good.  I'd say at minimum separate patches are needed
> for:
> 
> 	- permissions check on eof blocks
> 	- splitting eof blocks structure (I made comments on that to
> 	  Brain's posted patch)
> 	- introduce XFS conversion helpers
> 	- implement VFS-XFS boundary conversion using helpers
> 	- XFS ioctl conversions (maybe multiple patches)
> 	- final patch to modify kconfig once everything is done

Sure, I'll split it up and integrate the comments from the other
(eof blocks) thread as well.

> > We do need to decide on the di_uid that comes back from bulkstat.
> > Right now it is returning on disk (== init_user_ns) uids. It looks
> > to me like xfsrestore is using the normal vfs routines (chown,
> > fchown, lchown) when restoring so that won't line up if the
> > xfsrestore is run in !init_user_ns.
> 
> Right. This is a major problem because there's no guarantee that you
> are running restore in the same namespace as dump was run in. If you
> take the dump in the init_user_ns, then you can't restore it from
> inside the namespace container, and vice versa.

Yep. In thinking just a bit about this, assuming we did convert the
ids that came back from bulkstat, how is this much different than today
where you take a backup on one machine and try to restore it on
another? It seems like today the onus is on the user to ensure the uids
will align correctly. AFAICS tar --numeric-owner would have the same
issue, and it looks like man xfsrestore is getting at a similar
thing in the Quotas section.

> > We could possibly convert to userns values
> > before returning them from the kernel, but I doubt that will work
> > well with the xfs quotas.
> 
> Well, quotas are already converted to/from userns specific values
> before they get to XFS. Including project quotas, which I think at
> this point is wrong. We have no test cases for it, though, so I
> can't validate that it actually works as it is supposed to and that
> we don't break it inadvertantly in the future.
> 
> [ I'm still waiting on Eric to provide any information or scripts
> for how he tested this all worked when he did the conversions.... ]
> 
> > Should we just require that callers of bulkstat
> > be in init_user_ns? Thoughts?
> 
> This is one of the reasons why I want Eric to give us some idea of
> how this is supposed to work - exactly how is backup and restore
> supposed to be managed on a shared filesystem that is segmented up
> into multiple namespace containers? We can talk about the
> implementation all we like, but none of us have a clue to the policy
> decisions that users will make that we need to support. Until we
> have a clear idea on what policies we are supposed to be supporting,
> the implementation will be ambiguous and compromised.
> 
> e.g. If users are responsible for it, then bulkstat needs to filter
> based on the current namespace. If management is responsible (i.e.
> init_user_ns does backup/restore of ns-specific subtrees), then
> bulkstat cannot filter and needs to reject calls from outside the
> init_user_ns().

Maybe we can have bulkstat always filter based on if the caller
kuid_has_mapping(current_user_ns(), inode->i_uid)? That way a caller
from init_user_ns can see them all, but callers from inside a userns
will get a subset of inodes returned?

This doesn't solve the save from one uesrns, restore from a different
one use case, not sure if that was the scenario you were getting at. To
allow for this use case I guess we could have an "id offset" argument
for xfsrestore that gets applied before chown() but that seems icky.

> But if we have to allow both configurations - which I think we have to
> because both cases are valid choices for a hosting provider to give
> users - then how are we supposed to implement/handle this?
> 
> The same goes for the file handle interfaces - it's perfectly valid
> for a user to run a userspace NFS server (e.g. ganesha) inside a
> namespace container, but that will allow that process to provide
> unchecked remote access to the entire underlying filesystem, not
> just the namespace being exported. i.e. fs/export.c needs to be made
> namespace aware in some way so that there is a kernel wide policy
> for handle to file translations in namespace containers....

I haven't looked at the handle stuff, I'll take a look and get
familiarized.

> ---
> 
> FWIW, one comment on the wrappers now that I've quickly browsed the
> code:
> 
> > @@ -68,14 +68,15 @@ xfs_acl_from_disk(
> >  
> >  		switch (acl_e->e_tag) {
> >  		case ACL_USER:
> > +			acl_e->e_uid =
> > xfs_kuid_from_disk(be32_to_cpu(ace->ae_id));
> > +			break;
> >  		case ACL_GROUP:
> > -			acl_e->e_id = be32_to_cpu(ace->ae_id);
> > +			acl_e->e_gid =
> > xfs_kgid_from_disk(be32_to_cpu(ace->ae_id));
> 
> I know I suggested it, but I have to say, that does look a little
> weird. Normally the to/from disk routines do endian conversion
> internally, so perhaps the conversion routines coul dbe better
> named.
> 
> These:
> 
> 	acl_e->e_uid = xfs_uid_to_kuid(be32_to_cpu(ace->ae_id));
> 	acl_e->e_gid = xfs_gid_to_kgid(be32_to_cpu(ace->ae_id));
> 
> I think read a whole lot better, and the endian conversion now makes
> sense as that's converting the on-disk value first...
> 
> > @@ -101,7 +102,18 @@ xfs_acl_to_disk(struct xfs_acl *aclp, const
> > struct posix_acl *acl) acl_e = &acl->a_entries[i];
> >  
> >  		ace->ae_tag = cpu_to_be32(acl_e->e_tag);
> > -		ace->ae_id = cpu_to_be32(acl_e->e_id);
> > +		switch(acl_e->e_tag) {
> > +		case ACL_USER:
> > +			ace->ae_id =
> > cpu_to_be32(xfs_kuid_to_disk(acl_e->e_uid));
> > +			break;
> > +		case ACL_GROUP:
> > +			ace->ae_id =
> > cpu_to_be32(xfs_kgid_to_disk(acl_e->e_gid));
> 
> and:
> 			ace->ae_id =
> cpu_to_be32(xfs_kuid_to_uid(acl_e->e_uid)); ace->ae_id =
> cpu_to_be32(xfs_kgid_to_gid(acl_e->e_uid));
> 
> Sorry for asking you to redo this - sometimes an idea I have doesn't
> quite work out the first time :/

Heh, no problem, I agree these names are even better. Makes me wonder
if the return type of xfs_k[ug]id_to_[ug]id should be [ug]id_t instead
of __uint32_t? I'll use these new names in a split out v3 series to
follow. Thanks.

> Cheers,
> 
> Dave.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs