Re: [RFC] mm/kmemleak: Fix sleeping function called from invalid context at print message

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Thomas,
Thanks for your suggestion.

On Mon, Nov 18, 2024 at 3:29 PM Thomas Weißschuh <thomas.weissschuh@xxxxxxxxxxxxx> wrote:
On Fri, Nov 15, 2024 at 02:54:10PM +0000, Alessandro Carminati wrote:
> Address a bug in the kernel that triggers a "sleeping function called from
> invalid context" warning when /sys/kernel/debug/kmemleak is printed under
> specific conditions:
> - CONFIG_PREEMPT_RT=y
> - Set SELinux as the LSM for the system
> - Set kptr_restrict to 1
> - kmemleak buffer contains at least one item
> Ensure the kmemleak buffer contains at least one item
> Commit 8c96f1bc6fc49c724c4cdd22d3e99260263b7384 ("mm/kmemleak: turn
> kmemleak_lock and object->lock to raw_spinlock_t") introduced a change
> where kmemleak_seq_show is executed in atomic context within the RT kernel.
> However, the SELinux capability check in this function flow still relies on
> regular spinlocks, leading to potential race conditions that trigger an
> error when printing the kmemleak backtrace.
> Move the backtrace printing out of the critical section. Use a stack
> variable to store the backtrace pointers, avoiding spinlocks in the atomic
> context.
>
> Implement delta encoding to minimize the stack memory footprint,
> addressing the potentially large memory demands for storing these pointers
> on 64-bit systems.

The stacktrace is already stored in the stackdepot.
Shouldn't it be possible to take a reference to the stackdepot entry
inside the critical section and then use that reference outside of the
critical section for printing?

Yes, it is indeed possible to do that.

However, kmemleak operates in such a way that entries can be deleted, for
example, if they are found to be false positives.
My concern was that using a reference to a potentially deleted entry could
cause problems. But after considering your suggestion, I verified that
stack_depot_put, which is used to delete a stack depot entry, does not
appear to be called when a kmemleak object is deleted.
This makes me question whether that is the intended behavior.

As things currently stand, it seems possible to remove all the code I added
to store the trace in the stack and instead directly use the stack_depot
handle.

I would appreciate feedback from kmemleak and stackdepot experts regarding
this approach.
 

> Signed-off-by: Alessandro Carminati <acarmina@xxxxxxxxxx>
> ---
> ```
> [  159.247069] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
> [  159.247193] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat
> [  159.247241] preempt_count: 1, expected: 0
> [  159.247277] RCU nest depth: 2, expected: 2
> [  159.247388] 6 locks held by cat/136:
> [  159.247438]  #0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30
> [  159.248835]  #1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128
> [  159.249053]  #3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0
> [  159.249127]  #4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0
> [  159.249205]  #5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0
> [  159.249364] irq event stamp: 136660
> [  159.249407] hardirqs last  enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8
> [  159.249465] hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0
> [  159.249518] softirqs last  enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8
> [  159.249571] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [  159.249970] Preemption disabled at:
> [  159.249988] [<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0
> [  159.250609] CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G            E      6.11.0-rt7+ #34
> [  159.250797] Tainted: [E]=UNSIGNED_MODULE
> [  159.250822] Hardware name: linux,dummy-virt (DT)
> [  159.251050] Call trace:
> [  159.251079]  dump_backtrace+0xa0/0x128
> [  159.251132]  show_stack+0x1c/0x30
> [  159.251156]  dump_stack_lvl+0xe8/0x198
> [  159.251180]  dump_stack+0x18/0x20
> [  159.251227]  rt_spin_lock+0x8c/0x1a8
> [  159.251273]  avc_perm_nonode+0xa0/0x150
> [  159.251316]  cred_has_capability.isra.0+0x118/0x218
> [  159.251340]  selinux_capable+0x50/0x80
> [  159.251363]  security_capable+0x7c/0xd0
> [  159.251388]  has_ns_capability_noaudit+0x94/0x1b0
> [  159.251412]  has_capability_noaudit+0x20/0x30
> [  159.251437]  restricted_pointer+0x21c/0x4b0
> [  159.251461]  pointer+0x298/0x760
> [  159.251482]  vsnprintf+0x330/0xf70
> [  159.251504]  seq_printf+0x178/0x218
> [  159.251526]  print_unreferenced+0x1a4/0x2d0
> [  159.251551]  kmemleak_seq_show+0xd0/0x1e0
> [  159.251576]  seq_read_iter+0x354/0xe30
> [  159.251599]  seq_read+0x250/0x378
> [  159.251622]  full_proxy_read+0xd8/0x148
> [  159.251649]  vfs_read+0x190/0x918
> [  159.251672]  ksys_read+0xf0/0x1e0
> [  159.251693]  __arm64_sys_read+0x70/0xa8
> [  159.251716]  invoke_syscall.constprop.0+0xd4/0x1d8
> [  159.251767]  el0_svc+0x50/0x158
> [  159.251813]  el0t_64_sync+0x17c/0x180
> ```
> I have considered three potential approaches to address this matter:
>
> 1. Remove Raw Pointer Printing
> The simplest solution is to eliminate raw pointer printing from the report.
> This approach involves minimal changes to the kernel code and is
> straightforward to implement.
>
> While I am confident that omitting the raw address would result in
> negligible information loss in most scenarios, some may perceive it as a
> feature regression. Below is an example of the modification:
> ```
> - warn_or_seq_printf(seq, "    [<%pK>] %pS\n", ptr, ptr);
> + warn_or_seq_printf(seq, "    %pS\n", ptr);
> ```
> This change may be acceptable since the %pS format outputs a hex string
> if no kallsyms are available. However, it modifies the original behavior,
> and in the kallsyms scenario, the raw pointer would no longer be present.
>
> 2. Modify SELinux to Avoid Sleeping Spinlocks
> Another option is to alter the SELinux capability check to use
> non-sleeping spinlocks.
> However, this approach is not advisable. The SELinux capability check is
> extensively used across the kernel and is far more critical than the
> kmemleak reporting feature.
> Adapting it to address this rare issue could unnecessarily introduce
> latency across the entire kernel, particularly as kmemleak is rarely used
> in production environments.
>
> 3. Move Stack Trace Printing Outside the Atomic Section
> The third and preferred approach is to move the stack trace printing
> outside the atomic section. This would preserve the current functionality
> without modifying SELinux.
>
> The primary challenge here lies in making the backtrace pointers available
> after exiting the critical section, as they are captured within it.
> To address this, the backtrace pointers can be copied to a safe location,
> enabling access once the raw_spinlock is released.
>
> Options for Creating a Safe Location for Backtrace Pointers
> Several strategies have been considered for storing the backtrace pointers
> safely:
> * Dynamic Allocation
>     * Allocating memory with kmalloc cannot be done within a raw_spinlock
>       area. Using GFP_ATOMIC is also infeasible.
>     * Since the code that prints the message is inside a loop, executed
>       potentially multiple times, it is only within the raw_spinlock
>       section that we can determine whether allocation is needed.
>     * Allocating and deallocating memory on every loop iteration would be
>       prohibitively expensive.
> * Global Data Section
>     * In this strategy, the message would be printed after exiting the
>       raw_spinlock protected section.
>     * However, this approach risks data corruption if another occurrence
>       of the issue arises before the first operation completes.
> * Per-CPU Data
>     * The same concerns as with global data apply here. While data
>       corruption is less likely, it is not impossible.
> * Stack
>     * Using the stack is the best option since each thread has its own
>       stack, ensuring data isolation. However, the size of the data poses
>       a challenge.
>     * Exporting a full stack trace pointer list requires considerable space.
>       A 32-level stack trace in a 64-bit system would require 256 bytes,
>       which is contrary to best practices for stack size management.
>
> To mitigate this, I propose using delta encoding to store the addresses.
> This method reduces the size of each address from 8 bytes to 4 bytes on
> 64-bit systems. While this introduces some complexity, it significantly
> reduces memory usage and allows us to preserve the kmemleak reports in their
> current form.
>  mm/kmemleak.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 71 insertions(+), 7 deletions(-)
>
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 0400f5e8ac60..fc5869e09280 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -274,6 +274,44 @@ static void kmemleak_disable(void);
>               pr_warn(fmt, ##__VA_ARGS__);            \
>  } while (0)

> +#define PTR_STORAGE_OP_OK    -1
> +#define PTR_STORAGE_OP_FAIL  0
> +#define PTR_STORAGE_CAPACITY 32
> +
> +struct ptr_storage {
> +     unsigned long   base;
> +     u32             data[PTR_STORAGE_CAPACITY];
> +     int             nr_entries;
> +};
> +
> +static int ptr_storage_insert(unsigned long p, struct ptr_storage *s)
> +{
> +     unsigned long diff_data;
> +
> +     if (s->nr_entries != 0) {
> +             diff_data = s->base - p;
> +             if (s->nr_entries < PTR_STORAGE_CAPACITY) {
> +                     s->data[((s->nr_entries - 1))] = diff_data & 0xffffffff;
> +                     s->nr_entries++;
> +                     return PTR_STORAGE_OP_OK;
> +             }
> +             return PTR_STORAGE_OP_FAIL;
> +     }
> +     s->base = p;
> +     s->nr_entries++;
> +     return PTR_STORAGE_OP_OK;
> +}
> +
> +static void *ptr_storage_get(struct ptr_storage *s, int item_no)
> +{
> +     if (item_no < s->nr_entries && item_no > 0)
> +             return (void *)s->base - (s32)s->data[(item_no - 1)];
> +
> +     if (item_no == 0)
> +             return (void *)s->base;
> +     return NULL;
> +}
> +
>  static void warn_or_seq_hex_dump(struct seq_file *seq, int prefix_type,
>                                int rowsize, int groupsize, const void *buf,
>                                size_t len, bool ascii)
> @@ -357,11 +395,13 @@ static bool unreferenced_object(struct kmemleak_object *object)
>   * print_unreferenced function must be called with the object->lock held.
>   */
>  static void print_unreferenced(struct seq_file *seq,
> -                            struct kmemleak_object *object)
> +                            struct kmemleak_object *object,
> +                            struct ptr_storage *s)
>  {
>       int i;
>       unsigned long *entries;
>       unsigned int nr_entries;
> +     unsigned long tmp;

>       nr_entries = stack_depot_fetch(object->trace_handle, &entries);
>       warn_or_seq_printf(seq, "unreferenced object 0x%08lx (size %zu):\n",
> @@ -372,8 +412,8 @@ static void print_unreferenced(struct seq_file *seq,
>       warn_or_seq_printf(seq, "  backtrace (crc %x):\n", object->checksum);

>       for (i = 0; i < nr_entries; i++) {
> -             void *ptr = (void *)entries[i];
> -             warn_or_seq_printf(seq, "    [<%pK>] %pS\n", ptr, ptr);
> +             tmp = (unsigned long)entries[i];
> +             ptr_storage_insert(tmp, s);
>       }
>  }

> @@ -1625,6 +1665,10 @@ static void kmemleak_scan(void)
>       struct zone *zone;
>       int __maybe_unused i;
>       int new_leaks = 0;
> +     struct ptr_storage s = {0};
> +     bool do_print = false;
> +     void *tmp;
> +     int inx;

>       jiffies_last_scan = jiffies;

> @@ -1783,12 +1827,20 @@ static void kmemleak_scan(void)
>                   !(object->flags & OBJECT_REPORTED)) {
>                       object->flags |= OBJECT_REPORTED;

> -                     if (kmemleak_verbose)
> -                             print_unreferenced(NULL, object);
> +                     if (kmemleak_verbose) {
> +                             print_unreferenced(NULL, object, &s);
> +                             do_print = true;
> +                     }

>                       new_leaks++;
>               }
>               raw_spin_unlock_irq(&object->lock);
> +             if (kmemleak_verbose && do_print) {
> +                     for (inx = 0; inx < s.nr_entries; inx++) {
> +                             tmp = ptr_storage_get(&s, i);
> +                             warn_or_seq_printf(NULL, "    [<%pK>] %pS\n", tmp, tmp);
> +                     }
> +             }
>       }
>       rcu_read_unlock();

> @@ -1939,11 +1991,23 @@ static int kmemleak_seq_show(struct seq_file *seq, void *v)
>  {
>       struct kmemleak_object *object = v;
>       unsigned long flags;
> +     struct ptr_storage s = {0};
> +     void *tmp;
> +     int i;
> +     bool do_print = false;

>       raw_spin_lock_irqsave(&object->lock, flags);
> -     if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object))
> -             print_unreferenced(seq, object);
> +     if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object)) {
> +             print_unreferenced(seq, object, &s);
> +             do_print = true;
> +     }
>       raw_spin_unlock_irqrestore(&object->lock, flags);
> +     if (do_print) {
> +             for (i = 0; i < s.nr_entries; i++) {
> +                     tmp = ptr_storage_get(&s, i);
> +                     warn_or_seq_printf(seq, "    [<%pK>] %pS\n", tmp, tmp);
> +             }
> +     }
>       return 0;
>  }

> --
> 2.34.1
>



--
---
172

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux