Re: [PATCH v3] mm: kmem: add lockdep assertion to obj_cgroup_memcg

Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> · Wed, 31 Jul 2024 10:09:03 +0200

On 31.07.2024 09:02, Muchun Song wrote:
>> On Jul 31, 2024, at 02:52, Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> wrote:
>>
>> On 25.07.2024 11:43, Muchun Song wrote:
>>> The obj_cgroup_memcg() is supposed to safe to prevent the returned
>>> memory cgroup from being freed only when the caller is holding the
>>> rcu read lock or objcg_lock or cgroup_mutex. It is very easy to
>>> ignore thoes conditions when users call some upper APIs which call
>>> obj_cgroup_memcg() internally like mem_cgroup_from_slab_obj() (See
>>> the link below). So it is better to add lockdep assertion to
>>> obj_cgroup_memcg() to find those issues ASAP.
>>>
>>> Because there is no user of obj_cgroup_memcg() holding objcg_lock
>>> to make the returned memory cgroup safe, do not add objcg_lock
>>> assertion (We should export objcg_lock if we really want to do).
>>> Additionally, this is some internal implementation detail of memcg
>>> and should not be accessible outside memcg code.
>>>
>>> Some users like __mem_cgroup_uncharge() do not care the lifetime
>>> of the returned memory cgroup, which just want to know if the
>>> folio is charged to a memory cgroup, therefore, they do not need
>>> to hold the needed locks. In which case, introduce a new helper
>>> folio_memcg_charged() to do this. Compare it to folio_memcg(), it
>>> could eliminate a memory access of objcg->memcg for kmem, actually,
>>> a really small gain.
>>>
>>> Link: https://lore.kernel.org/all/20240718083607.42068-1-songmuchun@xxxxxxxxxxxxx/
>>> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
>> This patch landed in today's linux-next as commit 230b2f1f31b9 ("mm:
>> kmem: add lockdep assertion to obj_cgroup_memcg"). I my tests I found
>> that it triggers the following warning on Debian bookworm/sid system
>> image running under QEMU RISCV64:
> Thanks for your report.
>
> I'd like to say excellent since it indeed indicates this patch works
> well. Your report is actually a bug that I fixed it in [1] but not
> related to this patch.
>
> [1] https://lore.kernel.org/all/20240718083607.42068-1-songmuchun@xxxxxxxxxxxxx/

Confirmed. Applying [1] on top of next-20240730 fixes this issue without 
reverting $subject.

Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx>

>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 1 at include/linux/memcontrol.h:373
>> mem_cgroup_from_slab_obj+0x13e/0x1ea
>> Modules linked in:
>> CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.10.0+ #15154
>> Hardware name: riscv-virtio,qemu (DT)
>> epc : mem_cgroup_from_slab_obj+0x13e/0x1ea
>>   ra : mem_cgroup_from_slab_obj+0x13c/0x1ea
>> ...
>> [<ffffffff80257256>] mem_cgroup_from_slab_obj+0x13e/0x1ea
>> [<ffffffff801f0b3e>] list_lru_del_obj+0xa6/0xc2
>> [<ffffffff8027c6c6>] d_lru_del+0x8c/0xa4
>> [<ffffffff8027da60>] __dentry_kill+0x15e/0x17a
>> [<ffffffff8027ec3c>] dput.part.0+0x242/0x3e6
>> [<ffffffff8027edee>] dput+0xe/0x18
>> [<ffffffff8027324c>] lookup_fast+0x80/0xce
>> [<ffffffff80273e28>] walk_component+0x20/0x13c
>> [<ffffffff802747e2>] path_lookupat+0x64/0x16c
>> [<ffffffff80274bf4>] filename_lookup+0x76/0x122
>> [<ffffffff80274d80>] user_path_at+0x30/0x4a
>> [<ffffffff802d12bc>] __riscv_sys_name_to_handle_at+0x52/0x1d8
>> [<ffffffff80b60324>] do_trap_ecall_u+0x14e/0x1da
>> [<ffffffff80b6c546>] handle_exception+0xca/0xd6
>> irq event stamp: 198187
>> hardirqs last  enabled at (198187): [<ffffffff8028ca9e>]
>> lookup_mnt+0x186/0x308
>> hardirqs last disabled at (198186): [<ffffffff8028ca74>]
>> lookup_mnt+0x15c/0x308
>> softirqs last  enabled at (198172): [<ffffffff800e34f6>]
>> cgroup_apply_control_enable+0x1f6/0x2fc
>> softirqs last disabled at (198170): [<ffffffff800e34d8>]
>> cgroup_apply_control_enable+0x1d8/0x2fc
>> ---[ end trace 0000000000000000 ]---
>>
>> Similar warning appears on ARM64 Debian bookworm system. Reverting it on
>> top of linux-next hides the issue, but I assume this is not the best way
>> to fix it.
>>
>> I'm testing kernel built from riscv/defconfig with PROVE_LOCKING,
>> DEBUG_ATOMIC_SLEEP, DEBUG_DRIVER and DEBUG_DEVRES options enabled.
>>
>>> ---
>>> v3:
>>>   - Use lockdep_assert_once(Vlastimil).
>>>
>>> v2:
>>>   - Remove mention of objcg_lock in obj_cgroup_memcg()(Shakeel Butt).
>>>
>>>   include/linux/memcontrol.h | 20 +++++++++++++++++---
>>>   mm/memcontrol.c            |  6 +++---
>>>   2 files changed, 20 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index fc94879db4dff..95f823deafeca 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -360,11 +360,11 @@ static inline bool folio_memcg_kmem(struct folio *folio);
>>>    * After the initialization objcg->memcg is always pointing at
>>>    * a valid memcg, but can be atomically swapped to the parent memcg.
>>>    *
>>> - * The caller must ensure that the returned memcg won't be released:
>>> - * e.g. acquire the rcu_read_lock or css_set_lock.
>>> + * The caller must ensure that the returned memcg won't be released.
>>>    */
>>>   static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
>>>   {
>>> + lockdep_assert_once(rcu_read_lock_held() || lockdep_is_held(&cgroup_mutex));
>>>    return READ_ONCE(objcg->memcg);
>>>   }
>>>
>>> @@ -438,6 +438,19 @@ static inline struct mem_cgroup *folio_memcg(struct folio *folio)
>>>    return __folio_memcg(folio);
>>>   }
>>>
>>> +/*
>>> + * folio_memcg_charged - If a folio is charged to a memory cgroup.
>>> + * @folio: Pointer to the folio.
>>> + *
>>> + * Returns true if folio is charged to a memory cgroup, otherwise returns false.
>>> + */
>>> +static inline bool folio_memcg_charged(struct folio *folio)
>>> +{
>>> + if (folio_memcg_kmem(folio))
>>> + return __folio_objcg(folio) != NULL;
>>> + return __folio_memcg(folio) != NULL;
>>> +}
>>> +
>>>   /**
>>>    * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio.
>>>    * @folio: Pointer to the folio.
>>> @@ -454,7 +467,6 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
>>>    unsigned long memcg_data = READ_ONCE(folio->memcg_data);
>>>
>>>    VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
>>> - WARN_ON_ONCE(!rcu_read_lock_held());
>>>
>>>    if (memcg_data & MEMCG_DATA_KMEM) {
>>>    struct obj_cgroup *objcg;
>>> @@ -463,6 +475,8 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
>>>    return obj_cgroup_memcg(objcg);
>>>    }
>>>
>>> + WARN_ON_ONCE(!rcu_read_lock_held());
>>> +
>>>    return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
>>>   }
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 622d4544edd24..3da0284573857 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2366,7 +2366,7 @@ void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>>>
>>>   static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
>>>   {
>>> - VM_BUG_ON_FOLIO(folio_memcg(folio), folio);
>>> + VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio);
>>>    /*
>>>     * Any of the following ensures page's memcg stability:
>>>     *
>>> @@ -4617,7 +4617,7 @@ void __mem_cgroup_uncharge(struct folio *folio)
>>>    struct uncharge_gather ug;
>>>
>>>    /* Don't touch folio->lru of any random page, pre-check: */
>>> - if (!folio_memcg(folio))
>>> + if (!folio_memcg_charged(folio))
>>>    return;
>>>
>>>    uncharge_gather_clear(&ug);
>>> @@ -4662,7 +4662,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
>>>    return;
>>>
>>>    /* Page cache replacement: new folio already charged? */
>>> - if (folio_memcg(new))
>>> + if (folio_memcg_charged(new))
>>>    return;
>>>
>>>    memcg = folio_memcg(old);
>> Best regards
>> -- 
>> Marek Szyprowski, PhD
>> Samsung R&D Institute Poland
>
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland