On 9/15/21 01:29, Linus Torvalds wrote: > On Tue, Sep 14, 2021 at 3:48 PM Vlastimil Babka <vbabka@xxxxxxx> wrote: >> >> Well, looks like I can't. Commit 77e02cf57b6cf does boot fine for me, >> multiple times. But so now does the parent commit 6a4746ba06191. Looks like >> the magic is gone. I'm now surprised how deterministic it was during the >> bisect (most bad cases manifested on first boot, only few at second). > > Well, your report was clearly memory corruption by the invalid > memblock_free() just ending up causing random problems later on. > So it could easily be 100% deterministic with a certain memory layout > at a particular commit. And then enough other changes later, and it's > all gone, because the memory corruption now hits something else that > didn't even care. > > The code for your oops was > > 0: 48 8b 17 mov (%rdi),%rdx > 3: 48 39 d7 cmp %rdx,%rdi > 6: 74 43 je 0x4b > 8: 48 8b 47 08 mov 0x8(%rdi),%rax > c: 48 85 c0 test %rax,%rax > f: 74 23 je 0x34 > 11: 49 89 c0 mov %rax,%r8 > 14:* 48 8b 40 10 mov 0x10(%rax),%rax <-- trapping instruction > > and that's the start of rb_next(), so what's going on is that > "rb->rb_right" (the second word of 'struct rb_node') ends up having > that value in %rax: > > RAX: 343479726f6d656d > > which is ASCII "44yromem" rather than a valid pointer if I looked that up right. Yep, I was pretty sure it was related to the "/sys/bus/memory/devices/memory44" sysfs object and bisection would lead to kobject/sysfs or some memory hotplug related changes. So the result was a surprise. > And just _slightly_ different allocation patterns, and your 'struct > rb_node' gets allocated somewhere else, and you don't see the oops at > all, or you get it later in some different place. > > Most memory corruption doesn't cause oopses, because most memory isn't > used as pointers etc. > > What you _could_ try if you care enough is > > - go back to the thing you bisectted to where you can still hopefully > recreate the problem > > - apply that patch at that point with no other changes > > and then the test would hopefully be closer to the state you could > re-create the problem. > > And hopefully it would still not reproduce, just because the bug is > fixed, of course ;) Yeah, that worked! Commit 40caa127f3c7 was still broken, and cherry-pick of 77e02cf57b6cf on top fixed it. Thanks! > The very unlikely alternative is that your bisect was just pure random > bad luck and hit the wrong commit entirely, and the oops was due to > some other problem. > > But it does seem unlikely to be something else. Usually when bisects > go off into the weeds due to not being reproducible, they go very > obviously off into the weeds rather than point to something that ends > up having a very similar bug. > > Linus >