On 15/04/24, Eric W. Biederman wrote: > Richard Guy Briggs <rgb@xxxxxxxxxx> writes: > > On 15/04/22, Richard Guy Briggs wrote: > >> On 15/04/20, Eric W. Biederman wrote: > >> > Richard Guy Briggs <rgb@xxxxxxxxxx> writes: > >> > > >> > > The purpose is to track namespace instances in use by logged processes from the > >> > > perspective of init_*_ns by logging the namespace IDs (device ID and namespace > >> > > inode - offset). > >> > > >> > In broad strokes the user interface appears correct. > >> > > >> > Things that I see that concern me: > >> > > >> > - After Als most recent changes these inodes no longer live in the proc > >> > superblock so the device number reported in these patches is > >> > incorrect. > >> > >> Ok, found the patchset you're talking about: > >> 3d3d35b kill proc_ns completely > >> e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs > >> f77c801 bury struct proc_ns in fs/proc > >> 33c4294 copy address of proc_ns_ops into ns_common > >> 6344c43 new helpers: ns_alloc_inum/ns_free_inum > >> 6496452 make proc_ns_operations work with struct ns_common * instead of void * > >> 3c04118 switch the rest of proc_ns_operations to working with &...->ns > >> ff24870 netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns > >> 58be2825 make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns > >> 435d5f4 common object embedded into various struct ....ns > >> > >> Ok, I've got some minor jigging to do to get inum too... > > > > Do I even need to report the device number anymore since I am concluding > > s_dev is never set (or always zero) in the nsfs filesystem by > > mount_pseudo() and isn't even mountable? > > We still need the dev. We do have a device number get_anon_bdev fills it in. Fine, it has a device number. There appears to be only one of these allocated per kernel. I can get it from &nsfs->fs_supers (and take the first instance given by hlist_for_each_entry and verify there are no others). Why do I need it, again? > > In fact, I never needed to > > report the device since proc ida/idr and inodes are kernel-global and > > namespace-oblivious. > > This is the bit I really want to keep to be forward looking. If we > every need to preserve the inode numbers across a migration we could > have different super blocks with different inode numbers for the same > namespace. I don't quite follow your argument here, but can accept that in the future we might add other namespace devices. I wonder if we might do that augmentation later and leave out the device number for now... > >> > - I am nervous about audit logs being flooded with users creating lots > >> > of namespaces. But that is more your lookout than mine. > >> > >> There was a thought to create a filter to en/disable this logging... > >> It is an auxiliary record to syscalls, so they can be ignored by userspace tools. > >> > >> > - unshare is not logging when it creates new namespaces. > >> > >> They are all covered: > >> sys_unshare > unshare_userns > create_user_ns > >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns > >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns > >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns > >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace > >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns > > Then why the special change to fork? That was not reflected on > the unshare path as far as I could see. Fork can specify more than one CLONE flag at once, so collecting them all in one statementn seemed helpful. setns can only set one at a time. > >> > As small numbers are nice and these inodes all live in their own > >> > superblock now we should be able to remove the games with > >> > PROC_DYNAMIC_FIRST and just use small numbers for these inodes > >> > everywhere. > >> > >> That is compelling if I can untangle the proc inode allocation code from the > >> ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum) > >> to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock, > >> then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in > >> the place of the existing PROC_*_INIT_INO. > > Something like that. Just a new ida/idr allocator specific to that > superblock. > > Yeah. It is somewhere on my todo, but I have been prioritizing getting > the bugs that look potentially expoloitable fixed in the mount > namespace. Al made things nice for one case but left a mess for a bunch > of others. > > >> > I honestly don't know how much we are going to care about namespace ids > >> > during migration. So far this is not a problem that has come up. > >> > >> Not for CRIU, but it will be an issue for a container auditor that aggregates > >> information from individually auditted hosts. > >> > >> > I don't think migration becomes a practical concern (other than > >> > interface wise) until achieve a non-init namespace auditd. The easy way > >> > to handle migration would be to log a setns of every process from their > >> > old namespaces to their new namespaces. As you appear to have a setns > >> > event defined. > >> > >> Again, this would be taken care of by a layer above that is container-aware > >> across multiple hosts. > > > >> > How to handle the more general case beyond audit remains unclear. I > >> > think it will be a little while yet before we start dealing with > >> > migrating applications that care. When we do we will either need to > >> > generate some kind of hot-plug event that userspace can respond to and > >> > discover all of the appropriate file-system nodes have changed, or we > >> > will need to build a mechanism in the kernel to preserve these numbers. > >> > >> I don't expect to need to preserve these numbers. The higher layer application > >> will be able to do that translation. > > We need to be very aware of what is happening. > > The situation I am concerned about looks something like. > > Program A: > fd1 = open(/proc/self/ns/net); > fstat(fd1, &stat1) > > ... later ... > > fd2 = open(/var/run/netns/johnny); > fstat(fd2, &stat2); > > if ((stat1.st_dev == stat2.st_dev) && > (stat1.st_ino == stat2.st_ino)) { > /* Same netns do something... */ > } > > > What happens when we migrate Program A with it's cached stat data of > of a network namespace file? > > This requires either a hotplug event that Program A listens to or that > the inode number and device number are preserved across migration. > > Exactly what we do depends on where we are when it comes up. But this > is not something some layer about the program can abstract it all out so > we don't need to worry about it. Ok, understood, we can't just punt this one to a higher layer... So this comes back to a question above, which is how do we determine which device it is from? Sounds like we need something added to ns_common or one of the 6 namespace types structs. > Eric - RGB -- Richard Guy Briggs <rbriggs@xxxxxxxxxx> Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat Remote, Ottawa, Canada Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545 _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers