Re: [PATCH RFC 09/12] userns: Convert ocfs2 to use kuid and kgid where appropriate

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Wed, 13 Feb 2013 09:41:04 -0800

Joel Becker <jlbec@xxxxxxxxxxxx> writes:

> On Tue, Nov 20, 2012 at 04:43:37AM -0800, Eric W. Biederman wrote:
>> diff --git a/fs/ocfs2/acl.c b/fs/ocfs2/acl.c
>> index 260b162..8a40457 100644
>> --- a/fs/ocfs2/acl.c
>> +++ b/fs/ocfs2/acl.c
>> @@ -65,7 +65,20 @@ static struct posix_acl *ocfs2_acl_from_xattr(const void *value, size_t size)
>>  
>>  		acl->a_entries[n].e_tag  = le16_to_cpu(entry->e_tag);
>>  		acl->a_entries[n].e_perm = le16_to_cpu(entry->e_perm);
>> -		acl->a_entries[n].e_id   = le32_to_cpu(entry->e_id);
>> +		switch(acl->a_entries[n].e_tag) {
>> +		case ACL_USER:
>> +			acl->a_entries[n].e_uid =
>> +				make_kuid(&init_user_ns,
>> +					  le32_to_cpu(entry->e_id));
>> +			break;
>
> Stupid question: do you consider disjoint namespaces on multiple
> machines to be a problem?  Remember that ocfs2 is a cluster filesystem.
> If I have uid 100 on machine A in the default namespace, and then I
> mount the filesystem on machine B with uid 100 in a different namespace,
> what happens?  I presume that both can access as the same nominal uid,
> and configuring this correctly is left as an exercise to the namespace
> administrator?

Yep.  That is the way it has been since nfs first gave us that
challenge.  Sane user administrators of shared filesystems use the same
uids for the same functions accross all machines that use that
filesystem.

That said it possible (but not implemented in these patches) to have
a notion of a filesystem that lives in another user namespace than the
initial user namespace.   Essentially by capturing the usernamespace
at mount time and storing it on the super block.  For the generic case
that requires a little bit of infrastructure work for quotas.

At this point my goal is to get all of the conversions into all of the
right places and then for the people who care do the work to allow
mounting their filesystem in another user namespace.

It is a very practical problem that user namespace support can not be
enabled when filesystems that have not had kuid/kgid support pushed down
into them are enabled in the kernel.  So I am working hard to push down
kuids and kgids and find all of the places that need conversions, so
enabling user namespaces will not cause incorrect kernel behavior
because the wrong types were used somewhere.

>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>> index 4f7795f..f99af1c 100644
>> --- a/fs/ocfs2/dlmglue.c
>> +++ b/fs/ocfs2/dlmglue.c
>> @@ -2045,8 +2045,8 @@ static void __ocfs2_stuff_meta_lvb(struct inode *inode)
>>  	lvb->lvb_version   = OCFS2_LVB_VERSION;
>>  	lvb->lvb_isize	   = cpu_to_be64(i_size_read(inode));
>>  	lvb->lvb_iclusters = cpu_to_be32(oi->ip_clusters);
>> -	lvb->lvb_iuid      = cpu_to_be32(inode->i_uid);
>> -	lvb->lvb_igid      = cpu_to_be32(inode->i_gid);
>> +	lvb->lvb_iuid      = cpu_to_be32(i_uid_read(inode));
>> +	lvb->lvb_igid      = cpu_to_be32(i_gid_read(inode));
>
> 	I have the reverse question here.  Are we guaranteed that the
> on-disk uid/gid will not change regardless of the namespace?  That is,
> if I create a file on machine A in init_user_ns as uid 100, then access
> it over on machine B in some other namespace with a user-visible uid of
> 100, will the wire be passing 100 in both directions?  This absolutely
> must be true for the cluster communication to work.

The model I am working with is that for every filesystem there is
exactly one user namespace it stores the uids and gids in.   That user
namespace does not have to be the initial user namespace but there there
is one user namespace.

A user running in a user namespace different from the user namespace of
the filesystem will first have their uids and gids mapped to kuids and
kgids and then those kuids and kgids will be mapped to the on disk
representation.

Except for the odd ioctl or quota callback the vfs handles all of the
translation of uids and gids from user space to kuids and kgids.  Which
means the filesystems don't need to deal with what users are thinking,
and I don't need to teach filesystems to store an extended attribute
with user namespace information.

Which simplifies the problem for filesystems of dealing with kuid and
kgids coming from the vfs and translating those into the numbers you
want to store on disk.  Currently all filesystems are stored on disk in
the initial user namespace of the kernel.  So all of the conversions
into on disk structures are to the initial user namespace.

For network protocols there is the added challenge that you want to make
as certain as you can all of the parties are talking about uids and
gids in the same user namespace.  In general I make the assumption that
the filesystem's uid and gids are stored in the user namespace of the
process that mounts the filesystem, and those user space processes take
care of connecting you to other folks speaking of uids and gids in the
same user namespace.

In a few of my patches I have places where I can prevent and so I check
that the userspace process is in the initial user namespace and fail
otherwise.

Until someone does the work to deal with something other than the
initial user namespace in a filesystem and set the FS_USERNS_MOUNT flag
.fs_flags in struct filesystem a filesystem is guaranteed to always be
mounted in the initial user namespace.  So while there may be users in
other user namespaces the filesystem can just worring about getting
kuids and kgids and storing and communicating uids and gids in the
initial user namespace with the same logic it has always done.

Which is all a long way of saying if a user in another user namespace
with uid 100 which maps to uid 100100 in the initial user
namespace. Filesystems are expeted to treat that as a write from uid
100100 and if that isn't what the user who set up the other user
namespace wants to see in the on disk structures they should have used a
differenet mapping when setting up the user namespace.

And of course every uid maps in any user namespace has a lossless
mapping to and from the inital user namespace.

Hopefully that hopes to clear some confusion if there the intervening
time hadn't cleared that up.

Eric

_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/containers