Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

Andrei Vagin <avagin@xxxxxxxxx> · Fri, 14 Aug 2020 12:21:02 -0700

On Fri, Aug 14, 2020 at 06:11:58PM +0300, Kirill Tkhai wrote:
> On 14.08.2020 04:16, Andrei Vagin wrote:
> > On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
> >> On 12.08.2020 20:53, Andrei Vagin wrote:
> >>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> >>>> On 10.08.2020 20:34, Andrei Vagin wrote:
> >>>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >>>>>> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >>>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>>>>>>>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
> >>>>>>>>>
> >>>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>>>>>>>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
> >>>>>>>>>>>
> >>>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
> >>>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >>>>>>>>>>>> but some also may be as open files, which are not attached to a process.
> >>>>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
> >>>>>>>>>>>> impossible to know whether the namespace exists or not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
> >>>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >>>>>>>>>>>> this multiplies at tasks and fds number.
> >>>>>>>>>>>
> >>>>>>>>>>> I am very dubious about this.
> >>>>>>>>>>>
> >>>>>>>>>>> I have been avoiding exactly this kind of interface because it can
> >>>>>>>>>>> create rather fundamental problems with checkpoint restart.
> >>>>>>>>>>
> >>>>>>>>>> restart/restore :)
> >>>>>>>>>>
> >>>>>>>>>>> You do have some filtering and the filtering is not based on current.
> >>>>>>>>>>> Which is good.
> >>>>>>>>>>>
> >>>>>>>>>>> A view that is relative to a user namespace might be ok.    It almost
> >>>>>>>>>>> certainly does better as it's own little filesystem than as an extension
> >>>>>>>>>>> to proc though.
> >>>>>>>>>>>
> >>>>>>>>>>> The big thing we want to ensure is that if you migrate you can restore
> >>>>>>>>>>> everything.  I don't see how you will be able to restore these files
> >>>>>>>>>>> after migration.  Anything like this without having a complete
> >>>>>>>>>>> checkpoint/restore story is a non-starter.
> >>>>>>>>>>
> >>>>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
> >>>>>>>>>>
> >>>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
> >>>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
> >>>>>>>>>> problem here.
> >>>>>>>>>
> >>>>>>>>> An obvious diffference is that you are adding the inode to the inode to
> >>>>>>>>> the file name.  Which means that now you really do have to preserve the
> >>>>>>>>> inode numbers during process migration.
> >>>>>>>>>
> >>>>>>>>> Which means now we have to do all of the work to make inode number
> >>>>>>>>> restoration possible.  Which means now we need to have multiple
> >>>>>>>>> instances of nsfs so that we can restore inode numbers.
> >>>>>>>>>
> >>>>>>>>> I think this is still possible but we have been delaying figuring out
> >>>>>>>>> how to restore inode numbers long enough that may be actual technical
> >>>>>>>>> problems making it happen.
> >>>>>>>>
> >>>>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need
> >>>>>>>> change the names the namespaces are exported to particular fs and to support
> >>>>>>>> rename().
> >>>>>>>>
> >>>>>>>> Before introduction a principally new filesystem type for this, can't
> >>>>>>>> this be solved in current /proc?
> >>>>>>>
> >>>>>>> do you mean to introduce names for namespaces which users will be able
> >>>>>>> to change? By default, this can be uuid.
> >>>>>>
> >>>>>> Yes, I mean this.
> >>>>>>
> >>>>>> Currently I won't give a final answer about UUID, but I planned to show some
> >>>>>> default names, which based on namespace type and inode num. Completely custom
> >>>>>> names for any /proc by default will waste too much memory.
> >>>>>>
> >>>>>> So, I think the good way will be:
> >>>>>>
> >>>>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
> >>>>>> random seed, which is generated on boot;
> >>>>>>
> >>>>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
> >>>>>>
> >>>>>> 3)Allow rename, and allocate space only for renamed names.
> >>>>>>
> >>>>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
> >>>>>>
> >>>>>>> And I have a suggestion about the structure of /proc/namespaces/.
> >>>>>>>
> >>>>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
> >>>>>>> to group namespaces by their user-namespaces?
> >>>>>>>
> >>>>>>> /proc/namespaces/
> >>>>>>>                  user
> >>>>>>>                  mnt-X
> >>>>>>>                  mnt-Y
> >>>>>>>                  pid-X
> >>>>>>>                  uts-Z
> >>>>>>>                  user-X/
> >>>>>>>                         user
> >>>>>>>                         mnt-A
> >>>>>>>                         mnt-B
> >>>>>>>                         user-C
> >>>>>>>                         user-C/
> >>>>>>>                                user
> >>>>>>>                  user-Y/
> >>>>>>>                         user
> >>>>>>
> >>>>>> Hm, I don't think that user namespace is a generic key value for everybody.
> >>>>>> For generic people tasks a user namespace is just a namespace among another
> >>>>>> namespace types. For me it will look a bit strage to iterate some user namespaces
> >>>>>> to build container net topology.
> >>>>>
> >>>>> I can’t agree with you that the user namespace is one of others. It is
> >>>>> the namespace for namespaces. It sets security boundaries in the system
> >>>>> and we need to know them to understand the whole system.
> >>>>>
> >>>>> If user namespaces are not used in the system or on a container, you
> >>>>> will see all namespaces in one directory. But if the system has a more
> >>>>> complicated structure, you will be able to build a full picture of it.
> >>>>>
> >>>>> You said that one of the users of this feature is CRIU (the tool to
> >>>>> checkpoint/restore containers)  and you said that it would be good if
> >>>>> CRIU will be able to collect all container namespaces before dumping
> >>>>> processes, sockets, files etc. But how will we be able to do this if we
> >>>>> will list all namespaces in one directory?
> >>>>
> >>>> There is no a problem, this looks rather simple. Two cases are possible:
> >>>>
> >>>> 1)a container has dedicated namespaces set, and CRIU just has to iterate
> >>>>   files in /proc/namespaces of root pid namespace of the container.
> >>>>   The relationships between parents and childs of pid and user namespaces
> >>>>   are founded via ioctl(NS_GET_PARENT).
> >>>>   
> >>>> 2)container has no dedicated namespaces set. Then CRIU just has to iterate
> >>>>   all host namespaces. There is no another way to do that, because container
> >>>>   may have any host namespaces, and hierarchy in /proc/namespaces won't
> >>>>   help you.
> >>>>
> >>>>> Here are my thoughts why we need to the suggested structure is better
> >>>>> than just a list of namespaces:
> >>>>>
> >>>>> * Users will be able to understand securies bondaries in the system.
> >>>>>   Each namespace in the system is owned by one of user namespace and we
> >>>>>   need to know these relationshipts to understand the whole system.
> >>>>
> >>>> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
> >>>> this interfaces?
> >>>
> >>> We can use these ioctl-s, but we will need to enumerate all namespaces in
> >>> the system to build a view of the namespace hierarchy. This will be very
> >>> expensive. The kernel can show this hierarchy without additional cost.
> >>
> >> No. We will have to iterate /proc/namespaces of a specific container to get
> >> its namespaces. It's a subset of all namespaces in system, and these all the
> >> namespaces, which are potentially allowed for the container.
> > 
> > """
> > Every /proc is related to a pid_namespace, and the pid_namespace
> > is related to a user_namespace. The items, we show in this
> > /proc/namespaces/ directory, are the namespaces,
> > whose user_namespaces are the same as /proc's user_namespace,
> > or their descendants.
> > """ // [PATCH 11/23] fs: Add /proc/namespaces/ directory
> > 
> > This means that if a user want to find out all container namespaces, it
> > has to have access to the container procfs and the container should
> > a separate pid namespace.
> > 
> > I would say these are two big limitations. The first one will not affect
> > CRIU and I agree CRIU can use this interface in its current form.
> > 
> > The second one will be still the issue for CRIU. And they both will
> > affect other users.
> > 
> > For end users, it will be a pain. They will need to create a pid
> > namespaces in a specified user-namespace, if a container doesn't have
> > its own. Then they will need to mount /proc from the container pid
> > namespace and only then they will be able to enumerate namespaces.
> 
> In case of a container does not have its own pid namespace, CRIU already
> sucks. Every file in /proc directory is not reliable after restore,
> so /proc/namespaces is just one of them. Container, who may access files
> in /proc, does have to have its own pid namespace.

Can you be more detailed here? What files are not reliable? And why we
don't need to think about this use-case? If we have any issues here,
maybe we need to think how to fix them instead of adding a new one.

> 
> Even if we imagine an unreal situation, when the rest of /proc files are reliable,
> sub-directories won't help in this case also. In case of we introduce user ns
> hierarchy, the namespaces names above container's user ns, will still
> be unchangeble:
> 
> /proc/namespaces/parent_user_ns/container_user_ns/...
> 
> Path to container_user_ns is fixed. If container accesses /proc/namespace/parent_user_ns
> file, it will suck a pow after restore again.

In case of user ns hierarchy, a container will see only its sub-tree and
it will not know a name of its root namespace. It will look like this:

>From host:
/proc/namespaces/user_ns_ct1/user1
                             user2

/proc/namespaces/user_ns_ct2/user1
                             user2

>From ct1:
/proc/namespaces/user1
                 user2

And now could you explain how you are going to solve this problem with
your interface?

> 
> So, the suggested sub-directories just don't work.

I am sure it will work.

> 
> > But to build a view of a hierarchy of these namespaces, they will need to
> > use a binary tool which will open each of these namespaces, call
> > NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree.
> 
> Yes, it's the same way we have on a construction of tasks tree.
> 
> Linear /proc/namespaces is rather natural way. The sense is "all namespaces,
> which are available for tasks in this /proc directory".
> 
> Grouping by user ns directories looks odd. CRIU is only util, who needs
> such the grouping. But even for CRIU performance advantages look dubious.

I can't agree with you here. This isn't about CRIU. Grouping by user ns
doesn't look odd for me, because this is how namespaces are grouped in
the kernel.

> 
> For another utils, the preference of user ns grouping over another hierarchy
> namespaces looks just weirdy weird.
> 
> I can agree with an idea of separate top-level sub-directories for different
> namespaces types like:
> 
> /proc/namespaces/uts/
> /proc/namespaces/user/
> /proc/namespaces/pid/
> ...
> 
> But grouping of all another namespaces by user ns sub-directories absolutely
> does not look sane for me.

I think we are stuck here and we need to ask an opinion of someone else.

>  
> >>
> >>>>
> >>>>> * This is simplify collecting namespaces which belong to one container.
> >>>>>
> >>>>> For example, CRIU collects all namespaces before dumping file
> >>>>> descriptors. Then it collects all sockets with socket-diag in network
> >>>>> namespaces and collects mount points via /proc/pid/mountinfo in mount
> >>>>> namesapces. Then these information is used to dump socket file
> >>>>> descriptors and opened files.
> >>>>
> >>>> This is just the thing I say. This allows to avoid writing recursive dump.
> >>>
> >>> I don't understand this. How are you going to collect namespaces in CRIU
> >>> without knowing which are used by a dumped container?
> >>
> >> My patchset exports only the namespaces, which are allowed for a specific
> >> container, and no more above this. All exported namespaces are alive,
> >> so someone holds a reference on every of it. So they are used.
> >>
> >> It seems you haven't understood the way I suggested here. See patch [11/23]
> >> for the details. It's about permissions, and the subset of exported namespaces
> >> is formalized there.
> > 
> > Honestly, I have not read all patches in this series and you didn't
> > describe this behavior in the cover letter. Thank you for pointing out
> > to the 11 patch, but I still think it doesn't solve the problem
> > completely. More details is in the comment which is a few lines above
> > this one.
> > 
> >>
> >>>> But this has nothing about advantages of hierarchy in /proc/namespaces.
> > 
> > Yes, it has. For example, in cases when a container doesn't have its own
> > pid namespaces.
> > 
> >>>
> >>> Really? You said that you implemented this series to help CRIU dumping
> >>> namespaces. I think we need to implement the CRIU part to prove that
> >>> this interface is usable for this case. Right now, I have doubts about
> >>> this.
> >>
> >> Yes, really. See my comment above and patch [11/23].
> >>
> >>>>
> >>>>> * We are going to assign names to namespaces. But this means that we
> >>>>> need to guarantee that all names in one directory are unique. The
> >>>>> initial proposal was to enumerate all namespaces in one proc directory,
> >>>>> that means names of all namespaces have to be unique. This can be
> >>>>> problematic in some cases. For example, we may want to dump a container
> >>>>> and then restore it more than once on the same host. How are we going to
> >>>>> avoid namespace name conficts in such cases?
> >>>>
> >>>> Previous message I wrote about .rename of proc files, Alexey Dobriyan
> >>>> said this is not a taboo. Are there problem which doesn't cover the case
> >>>> you point?
> >>>
> >>> Yes, there is. Namespace names will be visible from a container, so they
> >>> have to be restored. But this means that two containers can't be
> >>> restored from the same snapshot due to namespace name conflicts.
> >>>
> >>> But if we will show namespaces how I suggest, each container will see
> >>> only its sub-tree of namespaces and we will be able to specify any name
> >>> for the container root user namespace.
> >>
> >> Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23].
> >>
> >> I do export sub-tree.
> > 
> > I got your idea, but it is unclear how your are going to avoid name
> > conflicts.
> > 
> > In the root container, you will show all namespaces in the system. These
> > means that all namespaces have to have unique names. This means we will
> > not able to restore two containers from the same snapshot without
> > renaming namespaces. But we can't change namespace names, because they
> > are visible from containers and container processes can use them.
> 
> Grouping by user ns sub-directories does not solve a problem with names
> of containers w/o own pid ns. See above.

It solves, you just doesn't understand how it works. See above.

> 
> >>
> >>>>
> >>>>> If we will have per-user-namespace directories, we will need to
> >>>>> guarantee that names are unique only inside one user namespace.
> >>>>
> >>>> Unique names inside one user namespace won't introduce a new /proc
> >>>> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
> >>>> container. To give a virtualized name you have to have a dedicated pid ns.
> >>>>
> >>>> Let we have in one /proc mount:
> >>>>
> >>>> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]
> >>>>
> >>>> In another another /proc mount we have:
> >>>>
> >>>> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]
> >>>>
> >>>> The virtualization is made per /proc (i.e., per pid ns). Container should
> >>>> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.
> >>>>
> >>>> There is no a sense of directory hierarchy for virtualization, since
> >>>> you can't use specific sub-directory as a root directory of /proc/namespaces
> >>>> to a container. You still have to introduce a new pid ns to have virtualized
> >>>> /proc.
> >>>
> >>> I think we can figure out how to implement this. As the first idea, we
> >>> can use the same way how /proc/net is implemented.
> >>>
> >>>>
> >>>>> * With the suggested structure, for each user namepsace, we will show
> >>>>>   only its subtree of namespaces. This looks more natural than
> >>>>>   filltering content of one directory.
> >>>>
> >>>> It's rather subjectively I think. /proc is related to pid ns, and user ns
> >>>> hierarchy does not look more natural for me.
> >>>
> >>> or /proc is wrong place for this
>