On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> writes: > >> On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote: >>> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote: >>> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote: >>> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig >>> > > wrote: >>> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley >>> > > > wrote: >>> > > > > > Another option would be to require something like a >>> > > > > > project as used for project quotas as the root. This would >>> > > > > > also be conveniant as it could storge the used remapping >>> > > > > > tables. >>> > > > > >>> > > > > So this would be like the current project quota except set on >>> > > > > a subtree? I could see it being done that way but I don't >>> > > > > see what advantage it has over using flags in the subtree >>> > > > > itself (the mapping is known based on the mount namespace, so >>> > > > > there's really only a single bit of information to store). >>> > > > >>> > > > projects (which are the underling concept for project quotas) >>> > > > are per-subtree in practice - the flag is set on an inode and >>> > > > then all directories and files underneath inherit the project >>> > > > ID, hardlinking outside a project is prohinited. >>> > > >>> > > I'm interested in having a VFS-level way to do more than just a >>> > > shift; I'd like to be able to arbitrarily remap IDs between >>> > > what's on disk and the system IDs. >>> > >>> > OK, so the shift is effectively an arbitrary remap because it >>> > allows multiple ranges to be mapped (althought the userns currently >>> > imposes a maximum number of five extents but that limit is a bit >>> > arbitrary just to try to limit the amount of space the >>> > parametrisation takes). See >>> > kernel/user_namespace.c:map_id_up/down() >>> > >>> > > If we're talking about developing a VFS-level solution for >>> > > this, I'd like to avoid limiting it to just a shift. (A >>> > > shift/range would definitely be the simplest solution for many >>> > > common container cases, but not all.) >>> > >>> > I assume the above satisfies you on this point, but raises the >>> > question: do you want an arbitrary shift not parametrised by a user >>> > namespace? If so how many such shifts do you want ... giving some >>> > details of the use case would be helpful. >>> >>> The limit of five extents means this may not work in the most general >>> case, no. >> >> That's not an API limit, so it can be changed if there's a need. The >> problem was merely how to parametrise a mapping without taking too much >> space. >> >>> One use case: given an on-disk filesystem, its name-to-number >>> mapping, and your host name-to-number mapping, mount the filesystem >>> with all the UIDs bidirectionally mapped to those on your host >>> system. >> >> This is pretty much what the s_user_ns does. >> >>> Another use case: given an on-disk filesystem with potentially >>> arbitrary UIDs (not necessarily in a clean contiguous block), and a >>> pile of unprivileged UIDs, mount the filesystem such that every on >>> -disk UID gets a unique unprivileged UID. >> >> So is this. Basically anything that begins by mounting gets a super >> block and can use the s_user_ns to map from the filesystem view to the >> kernel view of ids. Apart from greater sophistication in the >> parametrisation, it sounds like we have all the machinery you need. >> I'm sure the containers people will consider reasonable patches to >> change this. > > Yes. > > And to be clear we have all of that merged now and mostly present and > hooked up in all filesystems without any shiftfs like changes needed. > > To use this with a filesystem a last pass needs to be had to verify that > the cases where something does not map are handled cleanly. Still this does not answer the question how to dynamically *attach/share* data or read-only volumes as defined by orchestration/container tools into several containers. Am I missing something or is the plan to have per superblock mount for each one ? -- tixxdz