Re: [PATCH] selinux: cache the SID -> context string translation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/28/19 2:09 PM, Ondrej Mosnacek wrote:
On Mon, Oct 28, 2019 at 6:46 PM Stephen Smalley <sds@xxxxxxxxxxxxx> wrote:
On 10/25/19 10:55 AM, Ondrej Mosnacek wrote:
Translating a context struct to string can be quite slow, especially if
the context has a lot of category bits set. This can cause quite
noticeable performance impact in situations where the translation needs
to be done repeatedly. A common example is a UNIX datagram socket with
the SO_PASSSEC option enabled, which is used e.g. by systemd-journald
when receiving log messages via datagram socket. This scenario can be
reproduced with:

      cat /dev/urandom | base64 | logger &
      timeout 30s perf record -p $(pidof systemd-journald) -a -g
      kill %1
      perf report -g none --pretty raw | grep security_secid_to_secctx

Before the caching introduced by this patch, computing the context
string (security_secid_to_secctx() function) takes up ~65% of
systemd-journald's CPU time (assuming a context with 1024 categories
set and Fedora x86_64 release kernel configs). After this patch
(assuming near-perfect cache hit ratio) this overhead is reduced to just
~2%.

This patch addresses the issue by caching a certain number (compile-time
configurable) of recently used context strings to speed up repeated
translations of the same context, while using only a small amount of
memory.

The cache is integrated into the existing sidtab table by adding a field
to each entry, which when not NULL contains an RCU-protected pointer to
a cache entry containing the cached string. The cache entries are kept
in a linked list sorted according to how recently they were used. On a
cache miss when the cache is full, the least recently used entry is
removed to make space for the new entry.

The patch migrates security_sid_to_context_core() to use the cache (also
a few other functions where it was possible without too much fuss, but
these mostly use the translation for logging in case of error, which is
rare).

Running selinux-testsuite with lock debugging enabled,

<4>[  237.722577]
<4>[  237.722581] ================================
<4>[  237.722582] WARNING: inconsistent lock state
<4>[  237.722585] 5.4.0-rc1+ #45 Not tainted
<4>[  237.722586] --------------------------------
<4>[  237.722588] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
<4>[  237.722590] client/4460 [HC0[0]:SC1[1]:HE1:SE0] takes:
<4>[  237.722592] ffff888703331198 (&(&s->cache_lock)->rlock){+.?.}, at:
sidtab_sid2str_put.part.0+0x35/0x2f0
<4>[  237.722599] {SOFTIRQ-ON-W} state was registered at:
<4>[  237.722603]   lock_acquire+0xe5/0x210
<4>[  237.722606]   _raw_spin_lock+0x2c/0x40
<4>[  237.722608]   sidtab_sid2str_put.part.0+0x35/0x2f0
<4>[  237.722611]   sidtab_entry_to_string+0x86/0xa0
<4>[  237.722613]   security_sid_to_context_core.isra.0+0x147/0x190
<4>[  237.722615]   security_sid_to_context+0x23/0x30
<4>[  237.722618]   selinux_getprocattr+0x1f5/0x3f0
<4>[  237.722621]   proc_pid_attr_read+0xdf/0x170
<4>[  237.722623]   vfs_read+0xf3/0x1e0
<4>[  237.722625]   ksys_read+0xc3/0x160
<4>[  237.722628]   do_syscall_64+0x74/0xd0
<4>[  237.722630]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  237.722632] irq event stamp: 10362
<4>[  237.722635] hardirqs last  enabled at (10362):
[<ffffffff9f10badb>] _raw_write_unlock_irqrestore+0x4b/0x60
<4>[  237.722638] hardirqs last disabled at (10361):
[<ffffffff9f10bd41>] _raw_write_lock_irqsave+0x21/0x54
<4>[  237.722641] softirqs last  enabled at (10310):
[<ffffffff9eee6ea9>] ip_finish_output2+0x3f9/0x1070
<4>[  237.722644] softirqs last disabled at (10311):
[<ffffffff9f20107a>] do_softirq_own_stack+0x2a/0x40
<4>[  237.722645]
<4>[  237.722645] other info that might help us debug this:
<4>[  237.722647]  Possible unsafe locking scenario:
<4>[  237.722647]
<4>[  237.722648]        CPU0
<4>[  237.722650]        ----
<4>[  237.722651]   lock(&(&s->cache_lock)->rlock);
<4>[  237.722654]   <Interrupt>
<4>[  237.722657]     lock(&(&s->cache_lock)->rlock);
<4>[  237.722660]
<4>[  237.722660]  *** DEADLOCK ***
<4>[  237.722660]
<4>[  237.722662] 5 locks held by client/4460:
<4>[  237.722664]  #0: ffff88870a0a6650 (sk_lock-AF_INET){+.+.}, at:
inet_stream_connect+0x30/0x70
<4>[  237.722686]  #1: ffffffffa011e8e0 (rcu_read_lock){....}, at:
__ip_queue_xmit+0x5/0x9e0
<4>[  237.722690]  #2: ffffffffa011e8e0 (rcu_read_lock){....}, at:
process_backlog+0x8d/0x320
<4>[  237.722695]  #3: ffffffffa011e8e0 (rcu_read_lock){....}, at:
ip_local_deliver_finish+0x65/0x190
<4>[  237.722699]  #4: ffffffffa2023128
(&selinux_ss.policy_rwlock){.+.?}, at:
security_sid_to_context_core.isra.0+0xde/0x190
<4>[  237.722703]
<4>[  237.722703] stack backtrace:
<4>[  237.722707] CPU: 1 PID: 4460 Comm: client Not tainted 5.4.0-rc1+ #45
<4>[  237.722708] Hardware name: Dell Inc. OptiPlex 7050/0NW6H5, BIOS
1.8.3 03/23/2018
<4>[  237.722710] Call Trace:
<4>[  237.722712]  <IRQ>
<4>[  237.722716]  dump_stack+0x9a/0xf0
<4>[  237.722719]  mark_lock+0x7d9/0x9c0
<4>[  237.722722]  ? check_usage_backwards+0x230/0x230
<4>[  237.722725]  __lock_acquire+0xb5b/0x2600
<4>[  237.722730]  ? create_object+0x387/0x560
<4>[  237.722733]  ? lockdep_hardirqs_on+0x260/0x260
<4>[  237.722736]  ? mark_lock+0xac/0x9c0
<4>[  237.722738]  ? match_held_lock+0x2e/0x240
<4>[  237.722741]  ? match_held_lock+0x1b/0x240
<4>[  237.722744]  lock_acquire+0xe5/0x210
<4>[  237.722747]  ? sidtab_sid2str_put.part.0+0x35/0x2f0
<4>[  237.722751]  _raw_spin_lock+0x2c/0x40
<4>[  237.722753]  ? sidtab_sid2str_put.part.0+0x35/0x2f0
<4>[  237.722756]  sidtab_sid2str_put.part.0+0x35/0x2f0
<4>[  237.722760]  sidtab_sid2str_get+0x161/0x237
<4>[  237.722763]  sidtab_entry_to_string+0x31/0xa0
<4>[  237.722766]  security_sid_to_context_core.isra.0+0x147/0x190
<4>[  237.722770]  ? avc_audit_pre_callback+0x120/0x120
<4>[  237.722772]  security_sid_to_context+0x23/0x30
<4>[  237.722774]  avc_audit_post_callback+0xb5/0x360
<4>[  237.722777]  ? avc_audit_pre_callback+0x120/0x120
<4>[  237.722782]  ? avc_audit_pre_callback+0x120/0x120
<4>[  237.722784]  common_lsm_audit+0x22d/0xd10
<4>[  237.722787]  ? mark_lock+0xac/0x9c0
<4>[  237.722790]  ? ipv6_skb_to_auditdata+0x380/0x380
<4>[  237.722793]  ? __lock_acquire+0x5e0/0x2600
<4>[  237.722797]  ? match_held_lock+0x1b/0x240
<4>[  237.722800]  slow_avc_audit+0x14a/0x1c0
<4>[  237.722803]  ? avc_get_hash_stats+0x1b0/0x1b0
<4>[  237.722806]  ? find_held_lock+0x85/0xa0
<4>[  237.722808]  ? avc_has_perm+0x1ae/0x370
<4>[  237.722812]  ? lockdep_hardirqs_on+0x182/0x260
<4>[  237.722815]  ? avc_denied+0xd4/0xf0
<4>[  237.722818]  avc_has_perm+0x316/0x370
<4>[  237.722821]  ? avc_has_perm_noaudit+0x300/0x300
<4>[  237.722824]  ? sel_netnode_sid+0xbc/0x500
<4>[  237.722829]  selinux_socket_sock_rcv_skb+0x301/0x370
<4>[  237.722832]  ? selinux_sock_rcv_skb_compat+0x1f0/0x1f0
<4>[  237.722836]  ? mark_lock+0xac/0x9c0
<4>[  237.722839]  ? inet_ehashfn+0x80/0x170
<4>[  237.722842]  ? inet_ehash_locks_alloc+0x120/0x120
<4>[  237.722846]  security_sock_rcv_skb+0x45/0x60
<4>[  237.722849]  sk_filter_trim_cap+0x9b/0x500
<4>[  237.722852]  ? tcp_parse_md5sig_option+0x65/0x90
<4>[  237.722855]  ? bpf_get_listener_sock+0x70/0x70
<4>[  237.722858]  ? tcp_v4_reqsk_send_ack+0x1a0/0x1a0
<4>[  237.722862]  tcp_v4_rcv+0x1313/0x15f0
<4>[  237.722867]  ? tcp_v4_early_demux+0x2e0/0x2e0
<4>[  237.722886]  ? rcu_read_lock_held+0xa1/0xb0
<4>[  237.722889]  ? rcu_read_lock_sched_held+0xd0/0xd0
<4>[  237.722892]  ? lock_acquire+0xe5/0x210
<4>[  237.722895]  ip_protocol_deliver_rcu+0x3e/0x360
<4>[  237.722899]  ip_local_deliver_finish+0xec/0x190
<4>[  237.722902]  ip_local_deliver+0x252/0x300
<4>[  237.722905]  ? ip_local_deliver_finish+0x190/0x190
<4>[  237.722908]  ? ip_protocol_deliver_rcu+0x360/0x360
<4>[  237.722912]  ip_rcv+0x1fe/0x2a0
<4>[  237.722915]  ? ip_local_deliver+0x300/0x300
<4>[  237.722919]  ? ip_rcv_finish_core.isra.0+0x870/0x870
<4>[  237.722922]  ? lockdep_hardirqs_on+0x260/0x260
<4>[  237.722925]  ? ip_local_deliver+0x300/0x300
<4>[  237.722928]  __netif_receive_skb_one_core+0x109/0x120
<4>[  237.722931]  ? __netif_receive_skb_core+0x16f0/0x16f0
<4>[  237.722933]  ? lock_acquire+0xe5/0x210
<4>[  237.722936]  ? process_backlog+0x8d/0x320
<4>[  237.722940]  process_backlog+0x126/0x320
<4>[  237.722943]  net_rx_action+0x220/0x670
<4>[  237.722947]  ? napi_busy_loop+0x520/0x520
<4>[  237.722950]  ? rcu_read_lock_bh_held+0xb0/0xb0
<4>[  237.722953]  ? mark_held_locks+0x44/0x90
<4>[  237.722956]  __do_softirq+0x130/0x5c1
<4>[  237.722961]  do_softirq_own_stack+0x2a/0x40
<4>[  237.722963]  </IRQ>
<4>[  237.722965]  ? ip_finish_output2+0x3f9/0x1070
<4>[  237.722968]  do_softirq.part.0+0x60/0x70
<4>[  237.722971]  __local_bh_enable_ip+0x101/0x110
<4>[  237.722974]  ip_finish_output2+0x422/0x1070
<4>[  237.722978]  ? ip_frag_next+0x330/0x330
<4>[  237.722981]  ? ipv4_mtu+0xcf/0x150
<4>[  237.722985]  ? ip_output+0x119/0x340
<4>[  237.722987]  ip_output+0x119/0x340
<4>[  237.722991]  ? ip_mc_output+0x7f0/0x7f0
<4>[  237.722994]  ? __ip_finish_output+0x480/0x480
<4>[  237.722998]  __ip_queue_xmit+0x349/0x9e0
<4>[  237.723002]  ? tcp_options_write+0xce/0x360
<4>[  237.723005]  __tcp_transmit_skb+0xaca/0x1470
<4>[  237.723010]  ? __tcp_select_window+0x350/0x350
<4>[  237.723013]  ? rcu_read_lock_sched_held+0xd0/0xd0
<4>[  237.723016]  ? rb_insert_color+0x26b/0x3e0
<4>[  237.723020]  tcp_connect+0x1473/0x1a00
<4>[  237.723025]  ? __tcp_transmit_skb+0x1470/0x1470
<4>[  237.723027]  ? siphash_1u64+0x7/0x210
<4>[  237.723031]  ? secure_tcpv6_seq+0x1a0/0x1a0
<4>[  237.723034]  tcp_v4_connect+0x88c/0xac0
<4>[  237.723039]  ? tcp_v4_parse_md5_keys+0x180/0x180
<4>[  237.723043]  __inet_stream_connect+0x1ba/0x620
<4>[  237.723047]  ? ipv4_mib_init_net+0x2b0/0x2b0
<4>[  237.723050]  ? lock_acquire+0xe5/0x210
<4>[  237.723053]  ? __inet_stream_connect+0x620/0x620
<4>[  237.723055]  ? mark_held_locks+0x65/0x90
<4>[  237.723058]  ? __local_bh_enable_ip+0x9d/0x110
<4>[  237.723061]  ? __inet_stream_connect+0x620/0x620
<4>[  237.723064]  inet_stream_connect+0x44/0x70
<4>[  237.723067]  __sys_connect+0x16f/0x1b0
<4>[  237.723070]  ? __ia32_sys_accept+0x50/0x50
<4>[  237.723074]  ? ktime_get_coarse_real_ts64+0xb1/0xf0
<4>[  237.723077]  ? syscall_trace_enter+0x235/0x5f0
<4>[  237.723081]  ? audit_filter_syscall.part.0.constprop.0+0x2c0/0x2c0
<4>[  237.723084]  ? rcu_read_lock_sched_held+0xd0/0xd0
<4>[  237.723087]  ? exit_to_usermode_loop+0x120/0x120
<4>[  237.723089]  ? unroll_tree_refs+0x166/0x1c0
<4>[  237.723112]  ? kfree+0x2e6/0x350
<4>[  237.723114]  ? mark_held_locks+0x24/0x90
<4>[  237.723117]  ? trace_hardirqs_on_thunk+0x1a/0x20
<4>[  237.723121]  __x64_sys_connect+0x3e/0x50
<4>[  237.723124]  do_syscall_64+0x74/0xd0
<4>[  237.723127]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  237.723130] RIP: 0033:0x7f82c00a55b8
<4>[  237.723133] Code: 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00
00 90 f3 0f 1e fa 48 8d 05 65 80 0c 00 8b 00 85 c0 75 17 b8 2a 00 00 00
0f 05 <48> 3d 00 f0 ff ff 77 50 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 54
<4>[  237.723135] RSP: 002b:00007ffefad9fcb8 EFLAGS: 00000246 ORIG_RAX:
000000000000002a
<4>[  237.723138] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f82c00a55b8
<4>[  237.723139] RDX: 0000000000000010 RSI: 000000000216a2b0 RDI:
0000000000000003
<4>[  237.723141] RBP: 00007ffefad9fe50 R08: 0000000000000010 R09:
0000000000000000
<4>[  237.723143] R10: 00000000004005a0 R11: 0000000000000246 R12:
0000000000401130
<4>[  237.723145] R13: 00007ffefad9ff30 R14: 0000000000000000 R15:
0000000000000000

Ah, stupid me didn't test with the debug kernel variant... Looks like
I'll need to disable interrupts when holding the cache_lock after
all... I mistakenly thought it could be avoided in this case.

Thanks for testing!

I guess a larger concern would be the performance/scalability implications of introducing a spin_lock_irqsave() / spin_unlock_irqrestore() on this code path. We tried to keep the lookup path lock-free aside from the policydb read lock (which eventually can get turned into a rcu_read_lock if/when we convert the policydb to use RCU). You are taking this new spinlock even in in the cache-hit case IIUC. Also not clear on the implications of adding another atomic allocation on this code path and the resulting failure just from not being able to allocate a cache object; optimally a failure to allocate a cache object would just fall back to the uncached code path and not return back -ENOMEM. That code path might ultimately yield ENOMEM anyway but then again it might not...



[Index of Archives]     [Selinux Refpolicy]     [Linux SGX]     [Fedora Users]     [Fedora Desktop]     [Yosemite Photos]     [Yosemite Camping]     [Yosemite Campsites]     [KDE Users]     [Gnome Users]

  Powered by Linux