Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace / mtree_range_walk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 9/28/23 21:59, Liam R. Howlett wrote:
* Mirsad Todorovac <mirsad.todorovac@xxxxxxxxxxxx> [230923 03:27]:
On 9/22/23 15:51, Liam R. Howlett wrote:

...

[ 6413.367326] ==================================================================
[ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk

[ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
[ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
[ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
[ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
[ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
[ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
[ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
[ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
[ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
[ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
[ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
[ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)

[ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
[ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
[ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
[ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
[ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
[ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
[ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)

[ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116

[ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
[ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
[ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[ 6413.367660] ==================================================================

For your convenience, took the trouble of finding those suspicious lines of code:

The write side:

lib/maple_tree.c:491
--------------------
    456 /*
    457  * mas_set_parent() - Set the parent node and encode the slot
    458  * @enode: The encoded maple node.
    459  * @parent: The encoded maple node that is the parent of @enode.
    460  * @slot: The slot that @enode resides in @parent.
    461  *
    462  * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
    463  * parent type.
    464  */
    465 static inline
    466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
    467                     const struct maple_enode *parent, unsigned char slot)
    468 {
    469         unsigned long val = (unsigned long)parent;
    470         unsigned long shift;
    471         unsigned long type;
    472         enum maple_type p_type = mte_node_type(parent);
    473
    474         MAS_BUG_ON(mas, p_type == maple_dense);
    475         MAS_BUG_ON(mas, p_type == maple_leaf_64);
    476
    477         switch (p_type) {
    478         case maple_range_64:
    479         case maple_arange_64:
    480                 shift = MAPLE_PARENT_SLOT_SHIFT;
    481                 type = MAPLE_PARENT_RANGE64;
    482                 break;
    483         default:
    484         case maple_dense:
    485         case maple_leaf_64:
    486                 shift = type = 0;
    487                 break;
    488         }
    489
    490         val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
→ 491         val |= (slot << shift) | type;
    492         mte_to_node(enode)->parent = ma_parent_ptr(val);
    493 }

This should probably be 492, not 491.  I know what is racing here and it
shouldn't cause issue.

That seems logical to me, too.

...
The read side:

     527 /*
     528  * ma_dead_node() - check if the @enode is dead.
     529  * @enode: The encoded maple node
     530  *
     531  * Return: true if dead, false otherwise.
     532  */
     533 static inline bool ma_dead_node(const struct maple_node *node)
     534 {
     535         struct maple_node *parent;
     536
     537         /* Do not reorder reads from the node prior to the parent check */
     538         smp_rmb();
→  539         parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
     540         return (parent == node);
     541 }

This is the correct line.

Good. Now it makes sense.

...

as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
validity.

But I do not really understand maple trees to their depth, I am only reporting the symptomatic
outlook of the reported data-race.

This is the most complex part of the maple tree, and unique to the
ability to store a range over multiple existing entries.  I recently
rewrote this particular section to avoid a potential live-lock issue.

I see.

I am confident that the parent pointer will not be set to the node
pointer as checked by ma_dead_node().  But in an abundance of caution
and to resolve this potential false-positive, I am looking at this in
more detail.

This race is to see if the walk down the tree into unchanged data will
be seen before it is placed in the new subtree with its data also
unchanged.  Since we know the parent can never be the node, I feel this
is safe - but I'm not willing to live with the complaints from kasan.

I cannot analyse a couple of thousand lines at such a short notice.

Don't worry, I will :)

:-)

It looks suspicious because val in line 491 in a local variable and thus
isn't available elsewhere.

It is used in the node->parent, as described above.  It is a race, but
it doesn't matter who wins.


Maybe the compiler (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) did something
to the code?

Probably off-by-one line.


This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
open ...

I appreciate that and your detailed analysis with code pointers of where
this happens.  Is this easy to recreate?  If so, how?  Can you attach
your kernel config?

Got that attached first, so I do not forget. :-/

Recreate? Actually it happened quite a number of times on my configuration (480+?).

I'm having issues recreating it because I hit a lot of races before this
one in my test setup.  I will keep working on reproducing this race, but
in the mean time can you test the attached patch with your setup?

I have tried that on the recent vanilla torvalds tree kernel, and it still shows a number of
data-races:

[  157.300020] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  181.319610] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  196.868714] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  201.291452] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  226.185733] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  227.996664] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  228.447678] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  229.270275] BUG: KCSAN: data-race in mas_wr_modify / mtree_range_walk
[  229.889751] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  231.469987] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  234.872378] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[  235.159590] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  235.186811] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[  239.658108] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  241.893093] BUG: KCSAN: data-race in mas_topiary_replace / mas_walk
[  242.510624] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  244.692616] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  245.122822] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  246.000424] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[  246.678092] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  251.742829] BUG: KCSAN: data-race in mas_topiary_replace / mas_walk
[  251.762047] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  254.148155] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[  263.442284] BUG: KCSAN: data-race in mas_topiary_replace / mas_walk
[  267.004013] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  270.086804] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  270.116319] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  353.218023] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  962.775499] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  962.783627] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  964.604473] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  967.502757] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  967.539662] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  970.156674] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  970.629596] BUG: KCSAN: data-race in mas_wr_modify / mtree_range_walk
[  970.718835] BUG: KCSAN: data-race in mas_topiary_replace / mas_walk
[  971.023748] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  971.058710] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  971.584682] BUG: KCSAN: data-race in mas_wr_modify / mtree_range_walk
[  973.808208] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  974.993192] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  975.076067] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  977.460561] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  977.715483] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  978.078659] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[  978.186831] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[  995.034856] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 1006.951970] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[ 1072.950367] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 1146.949829] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[ 1174.940825] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 1245.325405] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 1287.939229] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 1523.333994] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk

However, for example, this one doesn't make sense:

==================================================================
BUG: KCSAN: data-race in mas_topiary_replace / mas_walk

write to 0xffff88811a9e0a00 of 8 bytes by task 9477 on cpu 15:
mas_topiary_replace (lib/maple_tree.c:304 lib/maple_tree.c:1731 lib/maple_tree.c:2572)
mas_spanning_rebalance.isra.0 (lib/maple_tree.c:2665 lib/maple_tree.c:2962)
mas_wr_spanning_store.isra.0 (lib/maple_tree.c:3895)
mas_wr_store_entry.isra.0 (lib/maple_tree.c:4243)
mas_store_gfp (lib/maple_tree.c:6090 lib/maple_tree.c:5386)
do_vmi_align_munmap (mm/internal.h:1090 mm/mmap.c:2535)
do_vmi_munmap (mm/mmap.c:2623)
mmap_region (mm/mmap.c:2673)
do_mmap (mm/mmap.c:1354)
vm_mmap_pgoff (mm/util.c:546)
ksys_mmap_pgoff (mm/mmap.c:1400)
__x64_sys_mmap (arch/x86/kernel/sys_x86_64.c:86)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

read to 0xffff88811a9e0a00 of 8 bytes by task 9475 on cpu 28:
mas_walk (lib/maple_tree.c:524 lib/maple_tree.c:556 lib/maple_tree.c:1375 lib/maple_tree.c:1359 lib/maple_tree.c:3690 lib/maple_tree.c:4844)
lock_vma_under_rcu (mm/memory.c:5436)
do_user_addr_fault (arch/x86/mm/fault.c:1357)
exc_page_fault (./arch/x86/include/asm/paravirt.h:695 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561)
asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:570)

value changed: 0xffff88811a7e6341 -> 0xffff88811a9e0a00

Reported by Kernel Concurrency Sanitizer on:
CPU: 28 PID: 9475 Comm: chrome Tainted: G             L     6.6.0-rc3-kcsan-00146-g9f3ebbef746f-dirty #2
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

On the write side, you have:

  302 static inline void mte_set_node_dead(struct maple_enode *mn)
  303 {
→ 304         mte_to_node(mn)->parent = ma_parent_ptr(mte_to_node(mn));
  305         smp_wmb(); /* Needed for RCU */
  306 }

On the read side:

  521 static inline struct maple_node *mte_parent(const struct maple_enode *enode)
  522 {
  523         return (void *)((unsigned long)
→ 524                         (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
  525 }

and:

  549 static inline bool mte_dead_node(const struct maple_enode *enode)
  550 {
  551         struct maple_node *parent, *node;
  552
  553         node = mte_to_node(enode);
  554         /* Do not reorder reads from the node prior to the parent check */
  555         smp_rmb();
→ 556         parent = mte_parent(enode);
  557         return (parent == node);
  558 }

I see both smp_wmb() and smp_rmb().

I don't know if this is a KCSAN false positive because I haven't seen what my GCC assembly looks like.

Maybe my AMD Ryzen 9 7950X with 16 cores / 32 threads aggravated things more than it is expected,
and it is not doing what it is supposed to with the memory barriers?

Looks like GCC did some "smart optimisation" of the write in line 304 or read in line 556?

I am only trying to imagine what the maple tree is going through attacked from 16 cores, 32 threads
and couple of hundred threads ...

Please find attached the decoded but not deduplicated dmesg logs. The config is attached, but it is
unmodified from the previous build.


...

Thanks,
Liam

Not at all. I am really puzzled by this and as I said it is a great exercise for my little grey cells.

I've caught a glimpse of Mr. McKenney's article: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf
and since 2010 things probably got much worse.

I would try this:

Write side:

  302 static inline void mte_set_node_dead(struct maple_enode *mn)
  303 {
→ 304         WRITE_ONCE(mte_to_node(mn)->parent, ma_parent_ptr(mte_to_node(mn)));
  305         smp_wmb(); /* Needed for RCU */
  306 }

On the read side:

  521 static inline struct maple_node *mte_parent(const struct maple_enode *enode)
  522 {
  523         return (void *)((unsigned long)
→ 524                         READ_ONCE(mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
  525 }

and:

  549 static inline bool mte_dead_node(const struct maple_enode *enode)
  550 {
  551         struct maple_node *parent, *node;
  552
  553         node = mte_to_node(enode);
  554         /* Do not reorder reads from the node prior to the parent check */
  555         smp_rmb();
→ 556         parent = mte_parent(enode);
  557         return (parent == node);
  558 }

... but I could as well lobotomise my Linux with experiments like that.

Best regards,
Mirsad Todorovac

Attachment: mtree-kcsan-20230930-decoded.log.xz
Description: application/xz

Attachment: mas-kcsan-20230930-02-decoded.log.xz
Description: application/xz

Attachment: config-6.6.0-rc3-kcsan-00146-g9f3ebbef746f-dirty.xz
Description: application/xz


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux