Re: Controlling devices and device namespaces

"Serge E. Hallyn" <serge@xxxxxxxxxx> · Sat, 15 Sep 2012 22:05:20 +0000

Quoting Eric W. Biederman (ebiederm@xxxxxxxxxxxx):
> "Serge E. Hallyn" <serge@xxxxxxxxxx> writes:
> 
> > Quoting Aristeu Rozanski (aris@xxxxxxxxx):
> >> Tejun,
> >> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
> >> >   memcg can be handled by memcg people and I can handle cgroup_freezer
> >> >   and others with help from the authors.  The problematic one is
> >> >   blkio.  If anyone is interested in working on blkio, please be my
> >> >   guest.  Vivek?  Glauber?
> >> 
> >> if Serge is not planning to do it already, I can take a look in device_cgroup.
> >
> > That's fine with me, thanks.
> >
> >> also, heard about the desire of having a device namespace instead with
> >> support for translation ("sda" -> "sdf"). If anyone see immediate use for
> >> this please let me know.
> >
> > Before going down this road, I'd like to discuss this with at least you,
> > me, and Eric Biederman (cc:d) as to how it relates to a device
> > namespace.
> 
> 
> The problem with devices.
> 
> - An unrestricted mknod gives you access to effectively any device in
>   the system.
> 
> - During process migration if the device number changes using
>   stat to file descriptors can fail on the same file descriptor.
> 
> - Devices coming from prexisting filesystems that we mount
>   as unprivileged users are as dangerous as mknod but show
>   that the problem is not limited to mknod.
> 
> - udev thinks mknod is a system call we can remove from the kernel.

Also,

 - udevadm trigger --action=add

causes all the devices known on the host to be re-sent to
everyone (all namespaces).  Which floods everyone and causes the
host to reset some devices.

> ---
> 
> The use cases seem comparitively simple to enumerate.
> 
> - Giving unfiltered access to a device to someone not root.
> 
> - Virtual devices that everyone uses and have no real privilege
>   requirements: /dev/null /dev/tty /dev/zero etc.
> 
> - Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN,
>   nbd, iscsi, /dev/ptsN, etc

and

 - per-namespace uevent filtering.

> ---
> 
> There are a couple of solution to these problems.
> 
> - The classic solution of creating a /dev for a container
>   before starting it.
> 
> - The devpts filesystem.  This works well for unprivileged access
>   to ptys.  Except for the /dev/ptmx sillines I very like how
>   things are handled today with devpts.
> 
> - Device control groups.  I am not quite certain what to make
>   of them.  The only case I see where they are better than
>   a prebuilt static dev is if there is a hotppluged device
>   that I want to push into my container.
> 
>   I think the only problem with device control groups and
>   hierarchies is that removing a device from a whitelist
>   does not recurse down the hierarchy.

That's going to be fixed soon thanks to Aristeu  :)

>   Can a process inside of a device control group create
>   a child group that has access to a subset of it's
>   devices?  The actually checks don't need to be hierarchical
>   but the presence of device nodes should be.

If I understand your question right, yes.

> ---
> 
> I see a couple of holes in the device control picture.
> 
> - How do we handle hotplug events?
> 
>   I think we can do this by relaying events trough userspace,
>   upating the device control groups etc.
> 
> - Unprivileged processess interacting with all of this.
>   (possibly with privilege in their user namespace)
>   What I don't know how to do is how to create a couple of different
>   subhierarchies each for different child processes.
> 
> - Dynamically created devices.
> 
>   My gut feel is that we should replicate the success of devpts
>   and give each type of dynamically created device it's own
>   filesystem and mount point under /dev, and just bend
>   the handful of userspace users into that model.

Phew.  Maybe.  Had not considered that.  But seems daunting.

> - Sysfs
> 
>   My gut says for the container use case we should aim to
>   simply not have dynamically created devices in sysfs
>   and then we can simply not care.
> 
> - Migration
> 
>   Either we need block device numbers that can migrate with us,
>   (possibly a subset of the entire range ala devpts) or we need to send
>   hotplug events to userspace right after a migration so userspace
>   processes that care can invalidate their caches of stat data.
> 
> ---
> 
> With the code in my userns development tree I can create a user
> namespace, create a new mount namespace, and then if I have
> access to any block devices mount filesystems, all without
> needing to have any special privileges.  What I haven't
> figured out is what it would take to get the the device
> control group into the middle that.

I'm really not sure that's a question we want to ask.  The
device control group, like the ns cgroup, was meant as a
temporary workaround to not having user and device namespaces.

If we can come up with a device cgroup model that works to
fill all the requirements we would have for a devices ns, then
great.  But I don't want us to be constrained by that.

> It feels like it should be possible to get the checks straight
> and use the device control group hooks to control which devices
> are usable in a user namespace.  Unfortunately when I try and work
> it out the independence of the user namespace and the device
> control group seem to make that impossible.
> 
> Shrug there is most definitely something missing from our
> model on how to handle devices well.  I am hoping we can
> sprinkling some devpts derived pixie dust at the problem
> migrate userspace to some new interfaces and have life
> be good.
> 
> Eric

Me too!

I'm torn between suggesting that we have a session at UDS to
discuss this, and not wanting to so that we can focus on the
remaining questions with the user namespace.

thanks,
-serge
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/containers