On 01/24/2012 01:12 AM, Al Viro wrote:
On Mon, Jan 23, 2012 at 08:56:08PM +0400, Glauber Costa wrote:
This patch creates a list of allowed filesystems per-namespace.
The goal is to prevent users inside a container, even root,
to mount filesystems that are not allowed by the main box admin.
My main two motivators to pursue this are:
1) We want to prevent a certain tailored view of some virtual
filesystems, for example, by bind-mounting files with userspace
generated data into /proc. The ability of mounting /proc inside
the container works against this effort, while disallowing it
via capabilities would have the effect of disallowing other
mounts as well.
Translation, please.
2) Some filesystems are known not to behave well under a container
environment. They require changes to work in a safe-way. We can
whitelist only the filesystems we want.
So fix them.
This works as a whitelist. Only filesystems in the list are allowed
to be mounted. Doing a blacklist would create problems when, say,
a module is loaded. The whitelist is only checked if it is enabled first.
So any setup that was already working, will keep working. And whoever
is not interested in limiting filesystem mount, does not need
to bother about it.
Please let me know what you guys think about it.
NAKed-by: Al Viro<viro@xxxxxxxxxxxxxxxxxx>
NAKed-because: too fucking ugly
This is bloody ridiculous; if you want to prevent a luser adming playing with
the set of mounts you've given it, the right way to go is not to mess with the
"which fs types are allowed" but to add a per-namespace "immutable" flag.
And add a new clone(2)/unshare(2) flag, used only along with the CLONE_NEWNS
and setting the "immutable" on the copied namespace.
Okay, not that I laid down the problem, I am happy to pursue any
solutions we think is better. But let me develop it a bit more, first.
An immutable flag does not work, because I don't want to prevent a luser
(loved that) to mess up with the mounts they are given. In general, it
is perfectly fine for them to mount things inside the cointainer as the
time goes.
But some others, I don't consider so. The example of /proc I've given,
let me elaborate: Much of the information living on /proc, is really
global, rather than per-container. The ones pertaining to pid namespace,
and other namespaces are already per-namespace so they are fine. But
there is more: some of the things /proc track, like cpu usage, memory,
and the like, are resource-constrained by other entities, for instance,
cgroups. In some cases, like /proc/stat, information exists in cgroup,
but come from more than once cgroup. All of them are independent in
nature, making it hard to come out with a
coherent vision.
Furthermore, there is no connection between namespaces and cgroups, so
it is not obvious at all (there were discussions before), which
information should the process see - unlike namespaces, the mere fact
that a process lives in a cgroup, does not really mean it is isolated
from the system in this sense.
One of the solutions, is to do it all in userspace, from outside the
container, and bind mount the files inside the container's /proc. But it
only works if we can prevent the user from remounting the real /proc
somewhere. Not because it would screw up his system, which I don't care
about, but because it will give him information about the global state
of the system.
An immutable flag fixes this, but then it prevents all further
legitimate mounts
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html