On 06.08.2020 11:05, Andrei Vagin wrote: > On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote: >> On 31.07.2020 01:13, Eric W. Biederman wrote: >>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes: >>> >>>> On 30.07.2020 17:34, Eric W. Biederman wrote: >>>>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes: >>>>> >>>>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>>>> but some also may be as open files, which are not attached to a process. >>>>>> When a namespace open fd is sent over unix socket and then closed, it is >>>>>> impossible to know whether the namespace exists or not. >>>>>> >>>>>> Also, even if namespace is exposed as attached to a process or as open file, >>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>>>> this multiplies at tasks and fds number. >>>>> >>>>> I am very dubious about this. >>>>> >>>>> I have been avoiding exactly this kind of interface because it can >>>>> create rather fundamental problems with checkpoint restart. >>>> >>>> restart/restore :) >>>> >>>>> You do have some filtering and the filtering is not based on current. >>>>> Which is good. >>>>> >>>>> A view that is relative to a user namespace might be ok. It almost >>>>> certainly does better as it's own little filesystem than as an extension >>>>> to proc though. >>>>> >>>>> The big thing we want to ensure is that if you migrate you can restore >>>>> everything. I don't see how you will be able to restore these files >>>>> after migration. Anything like this without having a complete >>>>> checkpoint/restore story is a non-starter. >>>> >>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >>>> >>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >>>> problem here. >>> >>> An obvious diffference is that you are adding the inode to the inode to >>> the file name. Which means that now you really do have to preserve the >>> inode numbers during process migration. >>> >>> Which means now we have to do all of the work to make inode number >>> restoration possible. Which means now we need to have multiple >>> instances of nsfs so that we can restore inode numbers. >>> >>> I think this is still possible but we have been delaying figuring out >>> how to restore inode numbers long enough that may be actual technical >>> problems making it happen. >> >> Yeah, this matters. But it looks like here is not a dead end. We just need >> change the names the namespaces are exported to particular fs and to support >> rename(). >> >> Before introduction a principally new filesystem type for this, can't >> this be solved in current /proc? > > do you mean to introduce names for namespaces which users will be able > to change? By default, this can be uuid. Yes, I mean this. Currently I won't give a final answer about UUID, but I planned to show some default names, which based on namespace type and inode num. Completely custom names for any /proc by default will waste too much memory. So, I think the good way will be: 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static random seed, which is generated on boot; 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")} 3)Allow rename, and allocate space only for renamed names. Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable. > And I have a suggestion about the structure of /proc/namespaces/. > > Each namespace is owned by one of user namespaces. Maybe it makes sense > to group namespaces by their user-namespaces? > > /proc/namespaces/ > user > mnt-X > mnt-Y > pid-X > uts-Z > user-X/ > user > mnt-A > mnt-B > user-C > user-C/ > user > user-Y/ > user Hm, I don't think that user namespace is a generic key value for everybody. For generic people tasks a user namespace is just a namespace among another namespace types. For me it will look a bit strage to iterate some user namespaces to build container net topology. > Do we try to invent cgroupfs for namespaces? Could you clarify your thought?