On 31.07.2020 01:13, Eric W. Biederman wrote: > Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes: > >> On 30.07.2020 17:34, Eric W. Biederman wrote: >>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes: >>> >>>> Currently, there is no a way to list or iterate all or subset of namespaces >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories, >>>> but some also may be as open files, which are not attached to a process. >>>> When a namespace open fd is sent over unix socket and then closed, it is >>>> impossible to know whether the namespace exists or not. >>>> >>>> Also, even if namespace is exposed as attached to a process or as open file, >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because >>>> this multiplies at tasks and fds number. >>> >>> I am very dubious about this. >>> >>> I have been avoiding exactly this kind of interface because it can >>> create rather fundamental problems with checkpoint restart. >> >> restart/restore :) >> >>> You do have some filtering and the filtering is not based on current. >>> Which is good. >>> >>> A view that is relative to a user namespace might be ok. It almost >>> certainly does better as it's own little filesystem than as an extension >>> to proc though. >>> >>> The big thing we want to ensure is that if you migrate you can restore >>> everything. I don't see how you will be able to restore these files >>> after migration. Anything like this without having a complete >>> checkpoint/restore story is a non-starter. >> >> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/. >> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files. >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any >> problem here. > > An obvious diffference is that you are adding the inode to the inode to > the file name. Which means that now you really do have to preserve the > inode numbers during process migration. > > Which means now we have to do all of the work to make inode number > restoration possible. Which means now we need to have multiple > instances of nsfs so that we can restore inode numbers. > > I think this is still possible but we have been delaying figuring out > how to restore inode numbers long enough that may be actual technical > problems making it happen. Yeah, this matters. But it looks like here is not a dead end. We just need change the names the namespaces are exported to particular fs and to support rename(). Before introduction a principally new filesystem type for this, can't this be solved in current /proc? Alexey, does rename() is prohibited for /proc fs? > Now maybe CRIU can handle the names of the files changing during > migration but you have just increased the level of difficulty for doing > that. > >> If you have a specific worries about, let's discuss them. > > I was asking and I am asking that it be described in the patch > description how a container using this feature can be migrated > from one machine to another. This code is so close to being problematic > that we need be very careful we don't fundamentally break CRIU while > trying to make it's job simpler and easier. > >> CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R. >> >>> Further by not going through the processes it looks like you are >>> bypassing the existing permission checks. Which has the potential >>> to allow someone to use a namespace who would not be able to otherwise. >> >> I agree, and I wrote to Christian, that permissions should be more strict. >> This just should be formalized. Let's discuss this. >> >>> So I think this goes one step too far but I am willing to be persuaded >>> otherwise. >>> > > Eric >