Re: [linus:master] [hugetlb] 7118fc2906: kernel_BUG_at_lib/list_debug.c

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/17/23 19:25, Linus Torvalds wrote:
> On Tue, Jan 17, 2023 at 4:22 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
>>
>> With the following patch to use 'O1' instead 'O2' gcc optoin for
>> page_alloc.c, the list corruption issue can't be reproduced for
>> commit 7118fc2906 in 1000 runs.
> 
> Ugh.
> 
> It would be lovely if you could just narrow it down with
> 
>   #pragma GCC optimize ("O1")
>  ...
>   #pragma GCC optimize ("O2")
> 
> around just that prep_compound_page(), but when I tried it myself I
> get some function attribute mismatch errors.
> 
> 
>> As is can't be reproduced with X86_64 build, it could be i386
>> compiling related.
> 
> Your particular config causes a huge amount of nasty 64-bit arithmetic
> according to the objdump code, with sequences like
> 
>   c13b3cbb:       83 05 d0 28 6c c5 01    addl   $0x1,0xc56c28d0
>   c13b3cc2:       83 15 d4 28 6c c5 00    adcl   $0x0,0xc56c28d4
> 
> which seems to be just from some coverage profiling being on
> (CONFIG_GCOV?), or something. It makes it very hard to read the code.

I think whatever this is, it's also responsible for the bug.
Now from all the dumps I've seen so far, the struct page corruptions looked
like prep_compound_page() is called more times than it should. Which didn't
make sense as the code is very simple, a loop of 1 << order - 1 iterations.

If I look at Feng's disassembly of 48b8d744ea84 (the "good" parent commit),
I can work out that edi holds the 1 << order, ebx is initialized as 1, and
there's a nice clear "inc ebx, cmp ebx, edi, jne <loop>".

Now the disassembly of the "bad" commit is much harder to follow when it
comes to the iteration control. There are xors and or's involved so I didn't
even try to work out the meaning. Clearly it can't be deterministically
wrong or it would crash always and fast. But crucially I think it uses
memory addresses 0xc56c28f8 and 0xc56c28fc in a racy way and that could
explain the bug.

Before the loop iteration itself, it initializes several registers from
these addreses, and the values end up modifying also registers used for the
loop iteration control (ebx/esi) And crucially in the loop iteration it also
writes some values to the same addresses, and there seems to be no
synchronization whatsoever. So it seems to me, even if these addresses are
supposed to be "private" to prep_compound_page(), running two
prep_compound_page() in parallel, or even due to interrupt (Hyeonggon did
reproduce this with -smp 1 as well) can result in a race ultimately
affecting the number of iterations taken.

Attaching the disasm I annotated for more details. Everything register whose
value is coming from 0xc56c28f8/0xc56c28fc in some way is annotated as
"crap" there (sorry).

> You also have UBSAN enabled, which - again - makes for some really
> grotty asm that hides any actual logic.
> 
> Finally, your objdump version also does some horrendous decoding, like
> 
>   c13b3e29:       8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
> 
> which is just a 7-byte 'nop' instruction, but again, it makes it
> rather hard to actually read the code.
> 
> With the i386 defconfig, gcc generates a function that is just ~30
> instructions for me, so this makes a huge difference in the legibility
> of the code.
> 
> I wonder if you can recreate the issue with a much more
> straightforward config. By all means, leave DEBUG_PAGEALLOC and SLUB
> debugging on, but without the things like UBSAN and GCOV.
> 
>> I also objdumped 'prep_compound_page' for vmlinux of 7118fc2906 and
>> its parent commit 48b8d744ea84, which have big difference than the
>> simple 'set_page_count()' change, but I can't tell which part is
>> abnormal, so attach them for further check.
> 
> Yeah, I can't make heads or tails of them either, see above on how
> illegible the objdump files are. And that's despite not even having
> all of prep_compound_page() in them (it's missing
> prep_compound_page.cold, which is probably just UBSAN fixup code, but
> who knows..)
> 
> That said, with the i386 defconfig, the only change from adding
> set_page_count() to the loop seems to be exactly that:
> 
>  .L589:
> -       movl    $1024, 12(%eax)
> +       movl    $0, 28(%eax)
>         addl    $32, %eax
> +       movl    $1024, -20(%eax)
>         movl    %esi, -28(%eax)
>         movl    $0, -12(%eax)
>         cmpl    %edx, %eax
> 
> (don't ask me why gcc does *one* access using the pre-incremented
> pointer, and then the rest to the post-incremented ones, but whatever
> - it means that it's not just "add a mov $0", it's also changing how
> the
> 
>         p->mapping = TAIL_MAPPING;
> 
> instruction is done, which is that
> 
> -       movl    $1024, 12(%eax)
> +       movl    $1024, -20(%eax)
> 
> part of the change)
> 
>              Linus
inputs:
eax: page
edx: order

stack below ebp
-0x04 saved edi
-0x08 saved esi
-0x0c saved ebp
(sub    $0x14,%esp)
-0x10  crap
-0x14  crap
-0x18  page|1 (compound_head value)
-0x1c  copy of order
-0x20  page
-0x24
esp

eax = page|1 (compound_head value) / crap
edx = page / crap
edi = page / crap
ebx = 1 << order (=2) / 1 << order -2 (=0) / crap!
ecx = page + 1 (first tail page)
esi = 0 / crap

c13b3b90 <prep_compound_page>:
c13b3b90:	55                   	push   %ebp
c13b3b91:	89 e5                	mov    %esp,%ebp
c13b3b93:	57                   	push   %edi
c13b3b94:	89 c7                	mov    %eax,%edi                            edi = page
c13b3b96:	56                   	push   %esi
c13b3b97:	53                   	push   %ebx
c13b3b98:	83 ec 14             	sub    $0x14,%esp
c13b3b9b:	83 fa 1f             	cmp    $0x1f,%edx                           
c13b3b9e:	89 55 e4             	mov    %edx,-0x1c(%ebp)                    
c13b3ba1:	0f 87 33 31 ed 01    	ja     c3286cda <prep_compound_page.cold>   only when order > 31 (?) 
c13b3ba7:	0f b6 4d e4          	movzbl -0x1c(%ebp),%ecx                     ecx = order
c13b3bab:	bb 01 00 00 00       	mov    $0x1,%ebx                            ebx = 1
c13b3bb0:	d3 e3                	shl    %cl,%ebx                             ebx = 1 << order  (=2)
c13b3bb2:	83 3f ff             	cmpl   $0xffffffff,(%edi)
c13b3bb5:	0f 84 65 02 00 00    	je     c13b3e20 <prep_compound_page+0x290>  only when page flags == 0xffffffff (?)
c13b3bbb:	83 05 d0 28 6c c5 01 	addl   $0x1,0xc56c28d0
c13b3bc2:	83 15 d4 28 6c c5 00 	adcl   $0x0,0xc56c28d4
c13b3bc9:	0f ba 2f 10          	btsl   $0x10,(%edi)                         sets bit 0x10 (PG_head) in page flags
c13b3bcd:	83 05 f0 28 6c c5 01 	addl   $0x1,0xc56c28f0
c13b3bd4:	83 15 f4 28 6c c5 00 	adcl   $0x0,0xc56c28f4
c13b3bdb:	83 fb 01             	cmp    $0x1,%ebx
c13b3bde:	0f 8e 80 00 00 00    	jle    c13b3c64 <prep_compound_page+0xd4>   nr_pages <= 1 -> skip over the whole loop
c13b3be4:	8d 47 01             	lea    0x1(%edi),%eax                       eax = page|1 (compound_head value)
c13b3be7:	8b 15 fc 28 6c c5    	mov    0xc56c28fc,%edx                      edx now some crap, we don't know how this was initialized                                          !!!
c13b3bed:	89 45 e8             	mov    %eax,-0x18(%ebp)                     saves the page|1                                                                                   
c13b3bf0:	a1 f8 28 6c c5       	mov    0xc56c28f8,%eax                      eax now some crap, again we don't know how this was initialized                                    !!!
c13b3bf5:	8d 4f 28             	lea    0x28(%edi),%ecx                      ecx = edi + 0x28 - first tail page
c13b3bf8:	89 7d e0             	mov    %edi,-0x20(%ebp)                     saves page pointer
c13b3bfb:	83 c0 01             	add    $0x1,%eax                            crap + 1 
c13b3bfe:	89 45 ec             	mov    %eax,-0x14(%ebp)                     saves the crap
c13b3c01:	83 d2 00             	adc    $0x0,%edx                            crap in crap out
c13b3c04:	a1 f8 28 6c c5       	mov    0xc56c28f8,%eax                      reset as crap
c13b3c09:	89 55 f0             	mov    %edx,-0x10(%ebp)                     save more crap
c13b3c0c:	8b 15 fc 28 6c c5    	mov    0xc56c28fc,%edx                      reset as crap
c13b3c12:	83 eb 02             	sub    $0x2,%ebx                            ebx is nr_pages - 2 (for order-1 was 2, now 0) - remaining tail pages after first iteration?
c13b3c15:	31 f6                	xor    %esi,%esi                            esi is 0
c13b3c17:	83 c0 02             	add    $0x2,%eax                            crap + 2
c13b3c1a:	83 d2 00             	adc    $0x0,%edx                            crap
c13b3c1d:	01 c3                	add    %eax,%ebx                            ebx is now modified by crap !!!                                                                    !!!
c13b3c1f:	8b 45 ec             	mov    -0x14(%ebp),%eax                     reset as crap
c13b3c22:	11 d6                	adc    %edx,%esi                            esi now also crap
c13b3c24:	8b 55 f0             	mov    -0x10(%ebp),%edx                     reset as crap
c13b3c27:	89 f7                	mov    %esi,%edi                            edi now also crap, oh my
c13b3c29:	89 de                	mov    %ebx,%esi                            supposed to be the remaining tail pages, byt modified by crap
c13b3c2b:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi                elaborate nop
c13b3c2f:	90                   	nop                                         nop

// this should be the loop for prep_compound_tail
c13b3c30:	a3 f8 28 6c c5       	mov    %eax,0xc56c28f8                      uh now we saved some value to the memory we read previously assuming some known good value?        !!!
c13b3c35:	8b 5d e8             	mov    -0x18(%ebp),%ebx                     ebx = page|1 (compound_head value)
c13b3c38:	83 c0 01             	add    $0x1,%eax                            still crap
c13b3c3b:	89 15 fc 28 6c c5    	mov    %edx,0xc56c28fc                      and now we wrote to the other address as well?                                                     !!!
c13b3c41:	83 d2 00             	adc    $0x0,%edx                            still crap
c13b3c44:	83 c1 28             	add    $0x28,%ecx                           ecx pointed to first tail pages, now points to second tail page
c13b3c47:	c7 41 e4 00 04 00 00 	movl   $0x400,-0x1c(%ecx)                   (first tail page)->mapping = TAIL_MAPPING;
c13b3c4e:	89 59 dc             	mov    %ebx,-0x24(%ecx)                     (first tail page)->compound_head is now set
c13b3c51:	89 fb                	mov    %edi,%ebx                            ebx now crap
c13b3c53:	31 d3                	xor    %edx,%ebx                            ebx xored with more crap
c13b3c55:	89 5d ec             	mov    %ebx,-0x14(%ebp)                     we might need that crap later
c13b3c58:	89 f3                	mov    %esi,%ebx                            supposed to be the remaining tail pages, byt modified by crap
c13b3c5a:	31 c3                	xor    %eax,%ebx                            yeah why not xor it with more crap
c13b3c5c:	0b 5d ec             	or     -0x14(%ebp),%ebx                     and "or" it with the previously saved crap
c13b3c5f:	75 cf                	jne    c13b3c30 <prep_compound_page+0xa0>   and that's how we decided if we should do another iteration :(

c13b3c61:	8b 7d e0             	mov    -0x20(%ebp),%edi                     restore edi = page

// here we land if we skip everything at c13b3bde
c13b3c64:	c6 47 30 01          	movb   $0x1,0x30(%edi)
// the rest shouldn't be important

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux