On 15/04/22, Richard Guy Briggs wrote: > On 15/04/20, Eric W. Biederman wrote: > > Richard Guy Briggs <rgb@xxxxxxxxxx> writes: > > > > > The purpose is to track namespace instances in use by logged processes from the > > > perspective of init_*_ns by logging the namespace IDs (device ID and namespace > > > inode - offset). > > > > In broad strokes the user interface appears correct. > > > > Things that I see that concern me: > > > > - After Als most recent changes these inodes no longer live in the proc > > superblock so the device number reported in these patches is > > incorrect. > > Ok, found the patchset you're talking about: > 3d3d35b kill proc_ns completely > e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs > f77c801 bury struct proc_ns in fs/proc > 33c4294 copy address of proc_ns_ops into ns_common > 6344c43 new helpers: ns_alloc_inum/ns_free_inum > 6496452 make proc_ns_operations work with struct ns_common * instead of void * > 3c04118 switch the rest of proc_ns_operations to working with &...->ns > ff24870 netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns > 58be2825 make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns > 435d5f4 common object embedded into various struct ....ns > > Ok, I've got some minor jigging to do to get inum too... Do I even need to report the device number anymore since I am concluding s_dev is never set (or always zero) in the nsfs filesystem by mount_pseudo() and isn't even mountable? In fact, I never needed to report the device since proc ida/idr and inodes are kernel-global and namespace-oblivious. > > - I am nervous about audit logs being flooded with users creating lots > > of namespaces. But that is more your lookout than mine. > > There was a thought to create a filter to en/disable this logging... > It is an auxiliary record to syscalls, so they can be ignored by userspace tools. > > > - unshare is not logging when it creates new namespaces. > > They are all covered: > sys_unshare > unshare_userns > create_user_ns > sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns > sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns > sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns > sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace > sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns > > > As small numbers are nice and these inodes all live in their own > > superblock now we should be able to remove the games with > > PROC_DYNAMIC_FIRST and just use small numbers for these inodes > > everywhere. > > That is compelling if I can untangle the proc inode allocation code from the > ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum) > to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock, > then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in > the place of the existing PROC_*_INIT_INO. > > > I have answered your comments below. > > More below... > > > > 1/10 exposes proc's ns entries structure which lists a number of useful > > > operations per namespace type for other subsystems to use. > > > > > > 2/10 proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST > > > > > > 3/10 provides an example of usage for audit_log_task_info() which is used by > > > syscall audits, among others. audit_log_task() and audit_common_recv_message() > > > would be other potential use cases. > > > > > > Proposed output format: > > > This differs slightly from Aristeu's patch because of the label conflict with > > > "pid=" due to including it in existing records rather than it being a seperate > > > record. It has now returned to being a seperate record. The proc device > > > major/minor are listed in hexadecimal and namespace IDs are the proc inode > > > minus the base offset. > > > type=NS_INFO msg=audit(1408577535.306:82): dev=00:03 netns=3 utsns=-3 ipcns=-4 pidns=-1 userns=-2 mntns=0 > > > > > > 4/10 change audit startup from __initcall to subsys_initcall to get it started > > > earlier to be able to receive initial namespace log messages. > > > > > > 5/10 tracks the creation and deletion of namespaces, listing the type of > > > namespace instance, proc device ID, related namespace id if there is one and > > > the newly minted namespace ID. > > > > > > Proposed output format for initial namespace creation: > > > type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none) utsns=-3 res=1 > > > type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_userns=(none) userns=-2 res=1 > > > type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_pidns=(none) pidns=-1 res=1 > > > type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none) mntns=0 res=1 > > > type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none) ipcns=-4 res=1 > > > type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none) netns=2 res=1 > > > > > > And a CLONE action would result in: > > > type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 old_netns=2 netns=3 res=1 > > > > > > While deleting a namespace would result in: > > > type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 mntns=4 res=1 > > > > > > 6/10 accepts a PID from userspace and requests logging an AUDIT_NS_INFO record > > > type (CAP_AUDIT_CONTROL required). > > > > > > 7/10 is a macro for CLONE_NEW_* flags. > > > > > > 8/10 adds auditing on creation of namespace(s) in fork. > > > > > > 9/10 adds auditing a change of namespace on setns. > > > > > > 10/10 attaches a AUDIT_NS_INFO record to AUDIT_VIRT_CONTROL records > > > (CAP_AUDIT_WRITE required). > > > > > > > > > v5 -> v6: > > > Switch to using namespace ID based on namespace proc inode minus base offset > > > Added proc device ID to qualify proc inode reference > > > Eliminate exposed /proc interface > > > > > > v4 -> v5: > > > Clean up prototypes for dependencies on CONFIG_NAMESPACES. > > > Add AUDIT_NS_INFO record type to AUDIT_VIRT_CONTROL record. > > > Log AUDIT_NS_INFO with PID. > > > Move /proc/<pid>/ns_* patches to end of patchset to deprecate them. > > > Log on changing ns (setns). > > > Log on creating new namespaces when forking. > > > Added a macro for CLONE_NEW*. > > > > > > v3 -> v4: > > > Seperate out the NS_INFO message from the SYSCALL message. > > > Moved audit_log_namespace_info() out of audit_log_task_info(). > > > Use a seperate message type per namespace type for each of INIT/DEL. > > > Make ns= easier to search across NS_INFO and NS_INIT/DEL_XXX msg types. > > > Add /proc/<pid>/ns/ documentation. > > > Fix dynamic initial ns logging. > > > > > > v2 -> v3: > > > Use atomic64_t in ns_serial to simplify it. > > > Avoid funciton duplication in proc, keying on dentry. > > > Squash down audit patch to avoid rcu sleep issues. > > > Add tracking for creation and deletion of namespace instances. > > > > > > v1 -> v2: > > > Avoid rollover by switching from an int to a long long. > > > Change rollover behaviour from simply avoiding zero to raising a BUG. > > > Expose serial numbers in /proc/<pid>/ns/*_snum. > > > Expose ns_entries and use it in audit. > > > > > > > > > Notes: > > > As for CAP_AUDIT_READ, a patchset has been accepted upstream to check > > > capabilities of userspace processes that try to join netlink broadcast groups. > > > > > > This set does not try to solve the non-init namespace audit messages and > > > auditd problem yet. That will come later, likely with additional auditd > > > instances running in another namespace with a limited ability to influence the > > > master auditd. I echo Eric B's idea that messages destined for different > > > namespaces would have to be tailored for that namespace with references that > > > make sense (such as the right pid number reported to that pid namespace, and > > > not leaking info about parents or peers). > > > > > > Questions: > > > Is there a way to link serial numbers of namespaces involved in migration of a > > > container to another kernel? It sounds like what is needed is a part of a > > > mangement application that is able to pull the audit records from constituent > > > hosts to build an audit trail of a container. > > > > I honestly don't know how much we are going to care about namespace ids > > during migration. So far this is not a problem that has come up. > > Not for CRIU, but it will be an issue for a container auditor that aggregates > information from individually auditted hosts. > > > I don't think migration becomes a practical concern (other than > > interface wise) until achieve a non-init namespace auditd. The easy way > > to handle migration would be to log a setns of every process from their > > old namespaces to their new namespaces. As you appear to have a setns > > event defined. > > Again, this would be taken care of by a layer above that is container-aware > across multiple hosts. > > > How to handle the more general case beyond audit remains unclear. I > > think it will be a little while yet before we start dealing with > > migrating applications that care. When we do we will either need to > > generate some kind of hot-plug event that userspace can respond to and > > discover all of the appropriate file-system nodes have changed, or we > > will need to build a mechanism in the kernel to preserve these numbers. > > I don't expect to need to preserve these numbers. The higher layer application > will be able to do that translation. > > > I really don't know which solution we will wind up with in the kernel at > > this point. > > > > > What additional events should list this information? > > > > At least unshare. > > Already covered as noted above. If it is a brand new namespace, it will show > the old one as "(none)" (or maybe zero now that we are looking at renumbering > the NS inodes). If it is an unshared one, it will show the old one from which > it was unshared. > > > > Does this present any problematic information leaks? Only CAP_AUDIT_CONTROL > > > (and now CAP_AUDIT_READ) in init_user_ns can get to this information in > > > the init namespace at the moment from audit. > > > > Good question. Today access to this information is generally guarded > > with CAP_SYS_PTRACE. > > > > I suspect for some of audits tracing features like this one we should > > also use CAP_SYS_PTRACE so that we have a consistent set of checks for > > getting information about applications. > > I assume CAP_SYS_PTRACE is orthogonal to CAP_AUDIT_{CONTROL,READ} and that > CAP_SYS_PTRACE would need to be insufficient to get that information. > > > Thanks for your thoughtful feedback, Eric. > > > Eric > > > > > Richard Guy Briggs (10): > > > namespaces: expose ns_entries > > > proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST > > > audit: log namespace ID numbers > > > audit: initialize at subsystem time rather than device time > > > audit: log creation and deletion of namespace instances > > > audit: dump namespace IDs for pid on receipt of AUDIT_NS_INFO > > > sched: add a macro to ref all CLONE_NEW* flags > > > fork: audit on creation of new namespace(s) > > > audit: log on switching namespace (setns) > > > audit: emit AUDIT_NS_INFO record with AUDIT_VIRT_CONTROL record > > > > > > fs/namespace.c | 13 +++ > > > fs/proc/generic.c | 3 +- > > > fs/proc/namespaces.c | 2 +- > > > include/linux/audit.h | 20 +++++ > > > include/linux/proc_ns.h | 10 ++- > > > include/uapi/linux/audit.h | 21 +++++ > > > include/uapi/linux/sched.h | 6 ++ > > > ipc/namespace.c | 12 +++ > > > kernel/audit.c | 169 +++++++++++++++++++++++++++++++++++++- > > > kernel/auditsc.c | 2 + > > > kernel/fork.c | 3 + > > > kernel/nsproxy.c | 4 + > > > kernel/pid_namespace.c | 13 +++ > > > kernel/user_namespace.c | 13 +++ > > > kernel/utsname.c | 12 +++ > > > net/core/net_namespace.c | 12 +++ > > > security/integrity/ima/ima_api.c | 2 + > > > 17 files changed, 309 insertions(+), 8 deletions(-) > > - RGB - RGB -- Richard Guy Briggs <rbriggs@xxxxxxxxxx> Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat Remote, Ottawa, Canada Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545 _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers