>On 12/15/2017 08:56 AM, Stephen Smalley wrote: >> On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote: >>> On 12/15/2017 10:31 PM, yangjihong wrote: >>>> On 12/14/2017 12:42 PM, Casey Schaufler wrote: >>>>> On 12/14/2017 9:15 AM, Stephen Smalley wrote: >>>>>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote: >>>>>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote: >>>>>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote: >>>>>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote: >>>>>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote: >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to >>>>>>>>>>> constantly starting numbers of docker ontainers with selinux >>>>>>>>>>> enabled, and after about 2 days, the kernel softlockup panic: >>>>>>>>>>> <IRQ> [<ffffffff810bb778>] >>>>>>>>>>> sched_show_task+0xb8/0x120 >>>>>>>>>>> [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0 >>>>>>>>>>> [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0 >>>>>>>>>>> [<ffffffff811224d0>] ? >>>>>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40 >>>>>>>>>>> [<ffffffff810abf82>] >>>>>>>>>>> __hrtimer_run_queues+0xd2/0x260 >>>>>>>>>>> [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0 >>>>>>>>>>> [<ffffffff8104a477>] >>>>>>>>>>> local_apic_timer_interrupt+0x37/0x60 >>>>>>>>>>> [<ffffffff8166fd90>] >>>>>>>>>>> smp_apic_timer_interrupt+0x50/0x140 >>>>>>>>>>> [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80 >>>>>>>>>>> <EOI> [<ffffffff812b4193>] ? >>>>>>>>>>> sidtab_context_to_sid+0xb3/0x480 >>>>>>>>>>> [<ffffffff812b41f0>] ? >>>>>>>>>>> sidtab_context_to_sid+0x110/0x480 >>>>>>>>>>> [<ffffffff812c0d15>] ? >>>>>>>>>>> mls_setup_user_range+0x145/0x250 >>>>>>>>>>> [<ffffffff812bd477>] >>>>>>>>>>> security_get_user_sids+0x3f7/0x550 >>>>>>>>>>> [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210 >>>>>>>>>>> [<ffffffff812b1960>] ? sel_write_member+0x200/0x200 >>>>>>>>>>> [<ffffffff812b01d8>] >>>>>>>>>>> selinux_transaction_write+0x48/0x80 >>>>>>>>>>> [<ffffffff811f444d>] vfs_write+0xbd/0x1e0 >>>>>>>>>>> [<ffffffff811f4eef>] SyS_write+0x7f/0xe0 >>>>>>>>>>> [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b >>>>>>>>>>> >>>>>>>>>>> My opinion: >>>>>>>>>>> when the docker container starts, it would mount overlay >>>>>>>>>>> filesystem with different selinux context, mount point such >>>>>>>>>>> as: >>>>>>>>>>> overlay on >>>>>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea >>>>>>>>>>> e4f6cb0f >>>>>>>>>>> 07b4 >>>>>>>>>>> bc32 >>>>>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay >>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox >>>>>>>>>>> _file_t: >>>>>>>>>>> s0:c >>>>>>>>>>> 414, >>>>>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV >>>>>>>>>>> 5CFWLADP >>>>>>>>>>> ARHH >>>>>>>>>>> WY7: >>>>>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS >>>>>>>>>>> :/var/li >>>>>>>>>>> b/do >>>>>>>>>>> cker >>>>>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/ >>>>>>>>>>> lib/dock >>>>>>>>>>> er/o >>>>>>>>>>> verl >>>>>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07 >>>>>>>>>>> 495ca08f >>>>>>>>>>> c9dd >>>>>>>>>>> b66/ >>>>>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f >>>>>>>>>>> c4530e0e >>>>>>>>>>> 952e >>>>>>>>>>> ae4f >>>>>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work) >>>>>>>>>>> shm on >>>>>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755 >>>>>>>>>>> 793449c9 >>>>>>>>>>> 1327 >>>>>>>>>>> ca57 >>>>>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs >>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob >>>>>>>>>>> ject_r:s >>>>>>>>>>> virt >>>>>>>>>>> _san >>>>>>>>>>> dbox_file_t:s0:c414,c873",size=65536k) >>>>>>>>>>> overlay on >>>>>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d02 >>>>>>>>>>> 55991dfb >>>>>>>>>>> 7258 >>>>>>>>>>> cbca >>>>>>>>>>> 14ff6d165b94353eefab/merged type overlay >>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox >>>>>>>>>>> _file_t: >>>>>>>>>>> s0:c >>>>>>>>>>> 431, >>>>>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF >>>>>>>>>>> B7ANVRHP >>>>>>>>>>> AVRC >>>>>>>>>>> RSS: >>>>>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI >>>>>>>>>>> ,upperdi >>>>>>>>>>> r=/v >>>>>>>>>>> ar/l >>>>>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991d >>>>>>>>>>> fb7258cb >>>>>>>>>>> ca14 >>>>>>>>>>> ff6d >>>>>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/ >>>>>>>>>>> 38d1544d >>>>>>>>>>> 0801 >>>>>>>>>>> 45c7 >>>>>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work >>>>>>>>>>> ) >>>>>>>>>>> shm on >>>>>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944 >>>>>>>>>>> 537a4bce >>>>>>>>>>> dc1d >>>>>>>>>>> cf05 >>>>>>>>>>> a65866458523ffd4a71614/shm type tmpfs >>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob >>>>>>>>>>> ject_r:s >>>>>>>>>>> virt >>>>>>>>>>> _san >>>>>>>>>>> dbox_file_t:s0:c431,c651",size=65536k) >>>>>>>>>>> >>>>>>>>>>> sidtab_search_context check the context whether is in the >>>>>>>>>>> sidtab list, If not found, a new node is generated and insert >>>>>>>>>>> into the list, As the number of containers is increasing, >>>>>>>>>>> context nodes are also more and more, we tested the final >>>>>>>>>>> number of nodes reached >>>>>>>>>>> 300,000 +, >>>>>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will >>>>>>>>>>> lead to the system softlockup. >>>>>>>>>>> >>>>>>>>>>> Is this a selinux bug? When filesystem umount, why context >>>>>>>>>>> node is not deleted? I cannot find the relevant function to >>>>>>>>>>> delete the node in sidtab.c >>>>>>>>>>> >>>>>>>>>>> Thanks for reading and looking forward to your reply. >>>>>>>>>> So, does docker just keep allocating a unique category set for >>>>>>>>>> every new container, never reusing them even if the container >>>>>>>>>> is destroyed? >>>>>>>>>> That would be a bug in docker IMHO. Or are you creating an >>>>>>>>>> unbounded number of containers and never destroying the older >>>>>>>>>> ones? >>>>>>>>> You can't reuse the security context. A process in ContainerA >>>>>>>>> sends a labeled packet to MachineB. ContainerA goes away and >>>>>>>>> its context is recycled in ContainerC. MachineB responds some >>>>>>>>> time later, again with a labeled packet. ContainerC gets >>>>>>>>> information intended for ContainerA, and uses the information >>>>>>>>> to take over the Elbonian government. >>>>>>>> Docker isn't using labeled networking (nor is anything else by >>>>>>>> default; it is only enabled if explicitly configured). >>>>>>> If labeled networking weren't an issue we'd have full security >>>>>>> module stacking by now. Yes, it's an edge case. If you want to >>>>>>> use labeled NFS or a local filesystem that gets mounted in each >>>>>>> container (don't tell me that nobody would do that) you've got >>>>>>> the same problem. >>>>>> Even if someone were to configure labeled networking, Docker is >>>>>> not presently relying on that or SELinux network enforcement for >>>>>> any security properties, so it really doesn't matter. >>>>> True enough. I can imagine a use case, but as you point out, it >>>>> would be a very complex configuration and coordination exercise >>>>> using SELinux. >>>>> >>>>>> And if they wanted >>>>>> to do that, they'd have to coordinate category assignments across >>>>>> all systems involved, for which no facility exists AFAIK. If you >>>>>> have two docker instances running on different hosts, I'd wager >>>>>> that they can hand out the same category sets today to different >>>>>> containers. >>>>>> >>>>>> With respect to labeled NFS, that's also not the default for nfs >>>>>> mounts, so again it is a custom configuration and Docker isn't >>>>>> relying on it for any guarantees today. For local filesystems, >>>>>> they would normally be context-mounted or using genfscon rather >>>>>> than xattrs in order to be accessible to the container, thus no >>>>>> persistent storage of the category sets. >>>> Well Kubernetes and OpenShift do set the labels to be the same >>>> within a project, and they can manage across nodes. But yes we are >>>> not using labeled networking at this point. >>>>> I know that is the intended configuration, but I see people do all >>>>> sorts of stoopid things for what they believe are good reasons. >>>>> Unfortunately, lots of people count on containers to provide >>>>> isolation, but create "solutions" for data sharing that defeat it. >>>>> >>>>>> Certainly docker could provide an option to not reuse category >>>>>> sets, but making that the default is not sane and just guarantees >>>>>> exhaustion of the SID and context space (just create and tear down >>>>>> lots of containers every day or more frequently). >>>>> It seems that Docker might have a similar issue with UIDs, but it >>>>> takes longer to run out of UIDs than sidtab entries. >>>>> >>>>>>>>>> On the selinux userspace side, we'd also like to eliminate the >>>>>>>>>> use of /sys/fs/selinux/user (sel_write_user -> >>>>>>>>>> security_get_user_sids) entirely, which is what triggered this >>>>>>>>>> for you. >>>>>>>>>> >>>>>>>>>> We cannot currently delete a sidtab node because we have no >>>>>>>>>> way of knowing if there are any lingering references to the >>>>>>>>>> SID. >>>>>>>>>> Fixing that would require reference-counted SIDs, which goes >>>>>>>>>> beyond just SELinux since SIDs/secids are returned by LSM >>>>>>>>>> hooks and cached in other kernel data structures. >>>>>>>>> You could delete a sidtab node. The code already deals with >>>>>>>>> unfindable SIDs. The issue is that eventually you run out of >>>>>>>>> SIDs. >>>>>>>>> Then you are forced to recycle SIDs, which leads to the >>>>>>>>> overthrow of the Elbonian government. >>>>>>>> We don't know when we can safely delete a sidtab node since SIDs >>>>>>>> aren't reference counted and we can't know whether it is still >>>>>>>> in use somewhere in the kernel. Doing so prematurely would lead >>>>>>>> to the SID being remapped to the unlabeled context, and then >>>>>>>> likely to undesired denials. >>>>>>> I would suggest that if you delete a sidtab node and someone >>>>>>> comes along later and tries to use it that denial is exactly what >>>>>>> you would desire. I don't see any other rational action. >>>>>> Yes, if we know that the SID wasn't in use at the time we tore it >>>>>> down. >>>>>> But if we're just randomly deleting sidtab entries based on age >>>>>> or something (since we have no reference count), we'll almost >>>>>> certainly encounter situations where a SID hasn't been accessed in >>>>>> a long time but is still being legitimately cached somewhere. >>>>>> Just a file that hasn't been accessed in a while might have that >>>>>> SID still cached in its inode security blob, or anywhere else. >>>>>> >>>>>>>>>> sidtab_search_context() could no doubt be optimized for the >>>>>>>>>> negative case; there was an earlier optimization for the >>>>>>>>>> positive case by adding a cache to sidtab_context_to_sid() >>>>>>>>>> prior to calling it. It's a reverse lookup in the sidtab. >>>>>>>>> This seems like a bad idea. >>>>>>>> Not sure what you mean, but it can certainly be changed to at >>>>>>>> least use a hash table for these reverse lookups. >>>>>>>> >>>>>>>> >>>>> >>>>> >>> Thanks for reply and discussion. >>> I think docker container is only a case, Is it possible there is a >>> similar way, through some means of attack, triggered a constantly >>> increasing of SIDs list, eventually leading to the system panic? >>> >>> I think the issue is that is takes too long to search SID node when >>> SIDs list too large, If can optimize the node's data structure(ie : >>> tree structure) or search algorithm to ensure that traversing all >>> nodes can be very short time even in many nodes, maybe it can solve >>> the problem. >>> Or, in sidtab.c provides "delete_sidtab_node" interface, when umount >>> fs, delete the SID node. Because when fs is umounted, the SID is >>> useless, could delete it to control the size of SIDs list. >>> >>> Thanks for reading and looking forward to your reply. >> We cannot safely delete entries in the sidtab without first adding >> reference counting of SIDs, which goes beyond just SELinux since they >> are cached in other kernel data structures and returned by LSM hooks. >> That's a non-trivial undertaking. >> >> Far more practical in the near term would be to introduce a hash table >> or other mechanism for efficient reverse lookups in the sidtab. Are >> you offering to implement that or just requesting it? >> Because I'm not very familiar with the overall architecture of selinux, so may be could not offer to implement, sorry. Or please tell me what I can do if I can help. If there is any progress(ie determine the solution or optimization method), could you please inform me about it? thanks! >> Independent of that, docker should support reuse of category sets when >> containers are deleted, at least as an option and probably as the >> default. >> >> >Docker does reuse categories of containers that are removed, by default. Thanks for reading and looking forward to your reply. Best wishes!