Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Fri, 19 Nov 2021 18:22:55 -0500

On Fri, 2021-11-19 at 19:14 +0200, Yordan Karadzhov wrote:
> On 19.11.21 г. 18:42 ч., James Bottomley wrote:
[...]
> > Can we back up and ask what problem you're trying to solve before
> > we start introducing new objects like namespace name?  The problem
> > statement just seems to be "Being able to see the structure of the
> > namespaces can be very useful in the context of the containerized
> > workloads."  which you later expanded on as "trying to add more
> > visibility into the working of things like kubernetes".  If you
> > just want to see the namespace "tree" you can script that (as root)
> > by matching the process tree and the /proc/<pid>/ns changes without
> > actually needing to construct it in the kernel.  This can also be
> > done without introducing the concept of a namespace name.  However,
> > there is a subtlety of doing this matching in the way I described
> > in that you don't get proper parenting to the user namespace
> > ownership ... but that seems to be something you don't want anyway?
> > 
> 
> The major motivation is to be able to hook tracing to individual
> containers. We want to be able to quickly discover the 
> PIDs of all containers running on a system. And when we say all, we
> mean not only Docker, but really all sorts of 
> containers that exist now or may exist in the future. We also
> considered the solution of brute-forcing all processes in 
> /proc/*/ns/ but we are afraid that such solution do not scale.

What do you mean does not scale?  ps and top use the /proc tree to
gather all the real time interface data for every process; do they not
"scale" as well and should therefore be done as in-kernel interfaces?

>  As I stated in the Cover letter, the problem was 
> discussed at Plumbers (links at the bottom of the Cover letter) and
> the conclusion was that the most distinct feature 
> that anything that can be called 'Container' must have is a separate
> PID namespace.

Unfortunately, I think I was fighting matrix fires at the time so
couldn't be there.  However, I'd have pushed back on the idea of
identifying containers by the pid namespace (mainly because most of the
unprivileged containers I set up don't have one).  Realistically, if
you're not a system container (need for pid 1) and don't have multiple
untrusted tenants (global process tree information leak), you likely
shouldn't be using the pid namespace either ... it just adds isolation
for no value.

>  This is why the PoC starts with the implementation of this
> namespace. You can see in the example script that discovering the
> name and all PIDs of all  containers gets quick and trivial with the
> help of this new filesystem. And you need to add just few more lines
> of code in order to make it start tracing a selected container.

But I could write a script or a tool to gather all the information
without this filesystem.  The namespace tree can be reconstructed by
anything that can view the process tree and the /proc/<pid>/ns
directory.

James