On Wed, Jan 18, 2023 at 02:35:17PM +0100, Vlastimil Babka wrote: > On 1/17/23 19:25, Linus Torvalds wrote: > > On Tue, Jan 17, 2023 at 4:22 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > >> > >> With the following patch to use 'O1' instead 'O2' gcc optoin for > >> page_alloc.c, the list corruption issue can't be reproduced for > >> commit 7118fc2906 in 1000 runs. > > > > Ugh. > > > > It would be lovely if you could just narrow it down with > > > > #pragma GCC optimize ("O1") > > ... > > #pragma GCC optimize ("O2") > > > > around just that prep_compound_page(), but when I tried it myself I > > get some function attribute mismatch errors. > > > > > >> As is can't be reproduced with X86_64 build, it could be i386 > >> compiling related. > > > > Your particular config causes a huge amount of nasty 64-bit arithmetic > > according to the objdump code, with sequences like > > > > c13b3cbb: 83 05 d0 28 6c c5 01 addl $0x1,0xc56c28d0 > > c13b3cc2: 83 15 d4 28 6c c5 00 adcl $0x0,0xc56c28d4 > > > > which seems to be just from some coverage profiling being on > > (CONFIG_GCOV?), or something. It makes it very hard to read the code. > > I think whatever this is, it's also responsible for the bug. > Now from all the dumps I've seen so far, the struct page corruptions looked > like prep_compound_page() is called more times than it should. Which didn't > make sense as the code is very simple, a loop of 1 << order - 1 iterations. > > If I look at Feng's disassembly of 48b8d744ea84 (the "good" parent commit), > I can work out that edi holds the 1 << order, ebx is initialized as 1, and > there's a nice clear "inc ebx, cmp ebx, edi, jne <loop>". > > Now the disassembly of the "bad" commit is much harder to follow when it > comes to the iteration control. There are xors and or's involved so I didn't > even try to work out the meaning. Clearly it can't be deterministically > wrong or it would crash always and fast. But crucially I think it uses > memory addresses 0xc56c28f8 and 0xc56c28fc in a racy way and that could > explain the bug. > > Before the loop iteration itself, it initializes several registers from > these addreses, and the values end up modifying also registers used for the > loop iteration control (ebx/esi) And crucially in the loop iteration it also > writes some values to the same addresses, and there seems to be no > synchronization whatsoever. So it seems to me, even if these addresses are > supposed to be "private" to prep_compound_page(), running two > prep_compound_page() in parallel, or even due to interrupt (Hyeonggon did > reproduce this with -smp 1 as well) can result in a race ultimately > affecting the number of iterations taken. Thanks for the decoding! For the strange memory address 0xc56c28xx in the assembly, they are GCOV related memory for prep_compound_page(), as from the System.map: c56c28a0 b __gcov0.free_compound_page c56c28c0 b __gcov0.prep_compound_page c56c2918 b __gcov0.early_debug_pagealloc And your analysis of racing of these region makes sense, that some pointer could be changed when a race happens. From the memory dump in previous thread, it was always one off issue, that page[2] for order 1 and page[8] for order 3 were wrongly written. Thanks, Feng > Attaching the disasm I annotated for more details. Everything register whose > value is coming from 0xc56c28f8/0xc56c28fc in some way is annotated as > "crap" there (sorry). > > > You also have UBSAN enabled, which - again - makes for some really > > grotty asm that hides any actual logic. > > > > Finally, your objdump version also does some horrendous decoding, like > > > > c13b3e29: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi > > > > which is just a 7-byte 'nop' instruction, but again, it makes it > > rather hard to actually read the code. > > > > With the i386 defconfig, gcc generates a function that is just ~30 > > instructions for me, so this makes a huge difference in the legibility > > of the code. > > > > I wonder if you can recreate the issue with a much more > > straightforward config. By all means, leave DEBUG_PAGEALLOC and SLUB > > debugging on, but without the things like UBSAN and GCOV. > > > >> I also objdumped 'prep_compound_page' for vmlinux of 7118fc2906 and > >> its parent commit 48b8d744ea84, which have big difference than the > >> simple 'set_page_count()' change, but I can't tell which part is > >> abnormal, so attach them for further check. > > > > Yeah, I can't make heads or tails of them either, see above on how > > illegible the objdump files are. And that's despite not even having > > all of prep_compound_page() in them (it's missing > > prep_compound_page.cold, which is probably just UBSAN fixup code, but > > who knows..) > > > > That said, with the i386 defconfig, the only change from adding > > set_page_count() to the loop seems to be exactly that: > > > > .L589: > > - movl $1024, 12(%eax) > > + movl $0, 28(%eax) > > addl $32, %eax > > + movl $1024, -20(%eax) > > movl %esi, -28(%eax) > > movl $0, -12(%eax) > > cmpl %edx, %eax > > > > (don't ask me why gcc does *one* access using the pre-incremented > > pointer, and then the rest to the post-incremented ones, but whatever > > - it means that it's not just "add a mov $0", it's also changing how > > the > > > > p->mapping = TAIL_MAPPING; > > > > instruction is done, which is that > > > > - movl $1024, 12(%eax) > > + movl $1024, -20(%eax) > > > > part of the change) > > > > Linus > inputs: > eax: page > edx: order > > stack below ebp > -0x04 saved edi > -0x08 saved esi > -0x0c saved ebp > (sub $0x14,%esp) > -0x10 crap > -0x14 crap > -0x18 page|1 (compound_head value) > -0x1c copy of order > -0x20 page > -0x24 > esp > > eax = page|1 (compound_head value) / crap > edx = page / crap > edi = page / crap > ebx = 1 << order (=2) / 1 << order -2 (=0) / crap! > ecx = page + 1 (first tail page) > esi = 0 / crap > > c13b3b90 <prep_compound_page>: > c13b3b90: 55 push %ebp > c13b3b91: 89 e5 mov %esp,%ebp > c13b3b93: 57 push %edi > c13b3b94: 89 c7 mov %eax,%edi edi = page > c13b3b96: 56 push %esi > c13b3b97: 53 push %ebx > c13b3b98: 83 ec 14 sub $0x14,%esp > c13b3b9b: 83 fa 1f cmp $0x1f,%edx > c13b3b9e: 89 55 e4 mov %edx,-0x1c(%ebp) > c13b3ba1: 0f 87 33 31 ed 01 ja c3286cda <prep_compound_page.cold> only when order > 31 (?) > c13b3ba7: 0f b6 4d e4 movzbl -0x1c(%ebp),%ecx ecx = order > c13b3bab: bb 01 00 00 00 mov $0x1,%ebx ebx = 1 > c13b3bb0: d3 e3 shl %cl,%ebx ebx = 1 << order (=2) > c13b3bb2: 83 3f ff cmpl $0xffffffff,(%edi) > c13b3bb5: 0f 84 65 02 00 00 je c13b3e20 <prep_compound_page+0x290> only when page flags == 0xffffffff (?) > c13b3bbb: 83 05 d0 28 6c c5 01 addl $0x1,0xc56c28d0 > c13b3bc2: 83 15 d4 28 6c c5 00 adcl $0x0,0xc56c28d4 > c13b3bc9: 0f ba 2f 10 btsl $0x10,(%edi) sets bit 0x10 (PG_head) in page flags > c13b3bcd: 83 05 f0 28 6c c5 01 addl $0x1,0xc56c28f0 > c13b3bd4: 83 15 f4 28 6c c5 00 adcl $0x0,0xc56c28f4 > c13b3bdb: 83 fb 01 cmp $0x1,%ebx > c13b3bde: 0f 8e 80 00 00 00 jle c13b3c64 <prep_compound_page+0xd4> nr_pages <= 1 -> skip over the whole loop > c13b3be4: 8d 47 01 lea 0x1(%edi),%eax eax = page|1 (compound_head value) > c13b3be7: 8b 15 fc 28 6c c5 mov 0xc56c28fc,%edx edx now some crap, we don't know how this was initialized !!! > c13b3bed: 89 45 e8 mov %eax,-0x18(%ebp) saves the page|1 > c13b3bf0: a1 f8 28 6c c5 mov 0xc56c28f8,%eax eax now some crap, again we don't know how this was initialized !!! > c13b3bf5: 8d 4f 28 lea 0x28(%edi),%ecx ecx = edi + 0x28 - first tail page > c13b3bf8: 89 7d e0 mov %edi,-0x20(%ebp) saves page pointer > c13b3bfb: 83 c0 01 add $0x1,%eax crap + 1 > c13b3bfe: 89 45 ec mov %eax,-0x14(%ebp) saves the crap > c13b3c01: 83 d2 00 adc $0x0,%edx crap in crap out > c13b3c04: a1 f8 28 6c c5 mov 0xc56c28f8,%eax reset as crap > c13b3c09: 89 55 f0 mov %edx,-0x10(%ebp) save more crap > c13b3c0c: 8b 15 fc 28 6c c5 mov 0xc56c28fc,%edx reset as crap > c13b3c12: 83 eb 02 sub $0x2,%ebx ebx is nr_pages - 2 (for order-1 was 2, now 0) - remaining tail pages after first iteration? > c13b3c15: 31 f6 xor %esi,%esi esi is 0 > c13b3c17: 83 c0 02 add $0x2,%eax crap + 2 > c13b3c1a: 83 d2 00 adc $0x0,%edx crap > c13b3c1d: 01 c3 add %eax,%ebx ebx is now modified by crap !!! !!! > c13b3c1f: 8b 45 ec mov -0x14(%ebp),%eax reset as crap > c13b3c22: 11 d6 adc %edx,%esi esi now also crap > c13b3c24: 8b 55 f0 mov -0x10(%ebp),%edx reset as crap > c13b3c27: 89 f7 mov %esi,%edi edi now also crap, oh my > c13b3c29: 89 de mov %ebx,%esi supposed to be the remaining tail pages, byt modified by crap > c13b3c2b: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi elaborate nop > c13b3c2f: 90 nop nop > > // this should be the loop for prep_compound_tail > c13b3c30: a3 f8 28 6c c5 mov %eax,0xc56c28f8 uh now we saved some value to the memory we read previously assuming some known good value? !!! > c13b3c35: 8b 5d e8 mov -0x18(%ebp),%ebx ebx = page|1 (compound_head value) > c13b3c38: 83 c0 01 add $0x1,%eax still crap > c13b3c3b: 89 15 fc 28 6c c5 mov %edx,0xc56c28fc and now we wrote to the other address as well? !!! > c13b3c41: 83 d2 00 adc $0x0,%edx still crap > c13b3c44: 83 c1 28 add $0x28,%ecx ecx pointed to first tail pages, now points to second tail page > c13b3c47: c7 41 e4 00 04 00 00 movl $0x400,-0x1c(%ecx) (first tail page)->mapping = TAIL_MAPPING; > c13b3c4e: 89 59 dc mov %ebx,-0x24(%ecx) (first tail page)->compound_head is now set > c13b3c51: 89 fb mov %edi,%ebx ebx now crap > c13b3c53: 31 d3 xor %edx,%ebx ebx xored with more crap > c13b3c55: 89 5d ec mov %ebx,-0x14(%ebp) we might need that crap later > c13b3c58: 89 f3 mov %esi,%ebx supposed to be the remaining tail pages, byt modified by crap > c13b3c5a: 31 c3 xor %eax,%ebx yeah why not xor it with more crap > c13b3c5c: 0b 5d ec or -0x14(%ebp),%ebx and "or" it with the previously saved crap > c13b3c5f: 75 cf jne c13b3c30 <prep_compound_page+0xa0> and that's how we decided if we should do another iteration :( > > c13b3c61: 8b 7d e0 mov -0x20(%ebp),%edi restore edi = page > > // here we land if we skip everything at c13b3bde > c13b3c64: c6 47 30 01 movb $0x1,0x30(%edi) > // the rest shouldn't be important