On 12/15/2017 08:56 AM, Stephen Smalley wrote:
On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote:On 12/15/2017 10:31 PM, yangjihong wrote:On 12/14/2017 12:42 PM, Casey Schaufler wrote:On 12/14/2017 9:15 AM, Stephen Smalley wrote:On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:On 12/14/2017 8:42 AM, Stephen Smalley wrote:On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:On 12/13/2017 7:18 AM, Stephen Smalley wrote:On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:Hello, I am doing stressing testing on 3.10 kernel(centos 7.4), to constantly starting numbers of docker ontainers with selinux enabled, and after about 2 days, the kernel softlockup panic: <IRQ> [<ffffffff810bb778>] sched_show_task+0xb8/0x120 [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0 [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0 [<ffffffff811224d0>] ? watchdog_enable_all_cpus.part.4+0x40/0x40 [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260 [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0 [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60 [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140 [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff812b4193>] ? sidtab_context_to_sid+0xb3/0x480 [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480 [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250 [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550 [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210 [<ffffffff812b1960>] ? sel_write_member+0x200/0x200 [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80 [<ffffffff811f444d>] vfs_write+0xbd/0x1e0 [<ffffffff811f4eef>] SyS_write+0x7f/0xe0 [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b My opinion: when the docker container starts, it would mount overlay filesystem with different selinux context, mount point such as: overlay on /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea e4f6cb0f 07b4 bc32 6cb07495ca08fc9ddb66/merged type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox _file_t: s0:c 414, c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV 5CFWLADP ARHH WY7: /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS :/var/li b/do cker /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/ lib/dock er/o verl ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07 495ca08f c9dd b66/ diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f c4530e0e 952e ae4f 6cb0f07b4bc326cb07495ca08fc9ddb66/work) shm on /var/lib/docker/containers/9fd65e177d2132011d7b422755 793449c9 1327 ca57 7b8f5d9d6a4adf218d4876/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:ob ject_r:s virt _san dbox_file_t:s0:c414,c873",size=65536k) overlay on /var/lib/docker/overlay2/38d1544d080145c7d76150530d02 55991dfb 7258 cbca 14ff6d165b94353eefab/merged type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox _file_t: s0:c 431, c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF B7ANVRHP AVRC RSS: /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI ,upperdi r=/v ar/l ib/docker/overlay2/38d1544d080145c7d76150530d0255991d fb7258cb ca14 ff6d 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/ 38d1544d 0801 45c7 d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work ) shm on /var/lib/docker/containers/662e7f798fc08b09eae0f0f944 537a4bce dc1d cf05 a65866458523ffd4a71614/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:ob ject_r:s virt _san dbox_file_t:s0:c431,c651",size=65536k) sidtab_search_context check the context whether is in the sidtab list, If not found, a new node is generated and insert into the list, As the number of containers is increasing, context nodes are also more and more, we tested the final number of nodes reached 300,000 +, sidtab_context_to_sid runtime needs 100-200ms, which will lead to the system softlockup. Is this a selinux bug? When filesystem umount, why context node is not deleted? I cannot find the relevant function to delete the node in sidtab.c Thanks for reading and looking forward to your reply.So, does docker just keep allocating a unique category set for every new container, never reusing them even if the container is destroyed? That would be a bug in docker IMHO. Or are you creating an unbounded number of containers and never destroying the older ones?You can't reuse the security context. A process in ContainerA sends a labeled packet to MachineB. ContainerA goes away and its context is recycled in ContainerC. MachineB responds some time later, again with a labeled packet. ContainerC gets information intended for ContainerA, and uses the information to take over the Elbonian government.Docker isn't using labeled networking (nor is anything else by default; it is only enabled if explicitly configured).If labeled networking weren't an issue we'd have full security module stacking by now. Yes, it's an edge case. If you want to use labeled NFS or a local filesystem that gets mounted in each container (don't tell me that nobody would do that) you've got the same problem.Even if someone were to configure labeled networking, Docker is not presently relying on that or SELinux network enforcement for any security properties, so it really doesn't matter.True enough. I can imagine a use case, but as you point out, it would be a very complex configuration and coordination exercise using SELinux.And if they wanted to do that, they'd have to coordinate category assignments across all systems involved, for which no facility exists AFAIK. If you have two docker instances running on different hosts, I'd wager that they can hand out the same category sets today to different containers. With respect to labeled NFS, that's also not the default for nfs mounts, so again it is a custom configuration and Docker isn't relying on it for any guarantees today. For local filesystems, they would normally be context-mounted or using genfscon rather than xattrs in order to be accessible to the container, thus no persistent storage of the category sets.Well Kubernetes and OpenShift do set the labels to be the same within a project, and they can manage across nodes. But yes we are not using labeled networking at this point.I know that is the intended configuration, but I see people do all sorts of stoopid things for what they believe are good reasons. Unfortunately, lots of people count on containers to provide isolation, but create "solutions" for data sharing that defeat it.Certainly docker could provide an option to not reuse category sets, but making that the default is not sane and just guarantees exhaustion of the SID and context space (just create and tear down lots of containers every day or more frequently).It seems that Docker might have a similar issue with UIDs, but it takes longer to run out of UIDs than sidtab entries.On the selinux userspace side, we'd also like to eliminate the use of /sys/fs/selinux/user (sel_write_user -> security_get_user_sids) entirely, which is what triggered this for you. We cannot currently delete a sidtab node because we have no way of knowing if there are any lingering references to the SID. Fixing that would require reference-counted SIDs, which goes beyond just SELinux since SIDs/secids are returned by LSM hooks and cached in other kernel data structures.You could delete a sidtab node. The code already deals with unfindable SIDs. The issue is that eventually you run out of SIDs. Then you are forced to recycle SIDs, which leads to the overthrow of the Elbonian government.We don't know when we can safely delete a sidtab node since SIDs aren't reference counted and we can't know whether it is still in use somewhere in the kernel. Doing so prematurely would lead to the SID being remapped to the unlabeled context, and then likely to undesired denials.I would suggest that if you delete a sidtab node and someone comes along later and tries to use it that denial is exactly what you would desire. I don't see any other rational action.Yes, if we know that the SID wasn't in use at the time we tore it down. But if we're just randomly deleting sidtab entries based on age or something (since we have no reference count), we'll almost certainly encounter situations where a SID hasn't been accessed in a long time but is still being legitimately cached somewhere. Just a file that hasn't been accessed in a while might have that SID still cached in its inode security blob, or anywhere else.sidtab_search_context() could no doubt be optimized for the negative case; there was an earlier optimization for the positive case by adding a cache to sidtab_context_to_sid() prior to calling it. It's a reverse lookup in the sidtab.This seems like a bad idea.Not sure what you mean, but it can certainly be changed to at least use a hash table for these reverse lookups.Thanks for reply and discussion. I think docker container is only a case, Is it possible there is a similar way, through some means of attack, triggered a constantly increasing of SIDs list, eventually leading to the system panic? I think the issue is that is takes too long to search SID node when SIDs list too large, If can optimize the node's data structure(ie : tree structure) or search algorithm to ensure that traversing all nodes can be very short time even in many nodes, maybe it can solve the problem. Or, in sidtab.c provides "delete_sidtab_node" interface, when umount fs, delete the SID node. Because when fs is umounted, the SID is useless, could delete it to control the size of SIDs list. Thanks for reading and looking forward to your reply.We cannot safely delete entries in the sidtab without first adding reference counting of SIDs, which goes beyond just SELinux since they are cached in other kernel data structures and returned by LSM hooks. That's a non-trivial undertaking. Far more practical in the near term would be to introduce a hash table or other mechanism for efficient reverse lookups in the sidtab. Are you offering to implement that or just requesting it? Independent of that, docker should support reuse of category sets when containers are deleted, at least as an option and probably as the default.
Docker does reuse categories of containers that are removed, by default.