Re: [GIT PULL] tracing: Fixes to bootconfig memory management

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/15/21 01:29, Linus Torvalds wrote:
> On Tue, Sep 14, 2021 at 3:48 PM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>>
>> Well, looks like I can't. Commit 77e02cf57b6cf does boot fine for me,
>> multiple times. But so now does the parent commit 6a4746ba06191. Looks like
>> the magic is gone. I'm now surprised how deterministic it was during the
>> bisect (most bad cases manifested on first boot, only few at second).
> 
> Well, your report was clearly memory corruption by the invalid
> memblock_free() just ending up causing random problems later on.

> So it could easily be 100% deterministic with a certain memory layout
> at a particular commit. And then enough other changes later, and it's
> all gone, because the memory corruption now hits something else that
> didn't even care.
> 
> The code for your oops was
> 
>    0: 48 8b 17              mov    (%rdi),%rdx
>    3: 48 39 d7              cmp    %rdx,%rdi
>    6: 74 43                je     0x4b
>    8: 48 8b 47 08          mov    0x8(%rdi),%rax
>    c: 48 85 c0              test   %rax,%rax
>    f: 74 23                je     0x34
>   11: 49 89 c0              mov    %rax,%r8
>   14:* 48 8b 40 10          mov    0x10(%rax),%rax <-- trapping instruction
> 
> and that's the start of rb_next(), so what's going on is that
> "rb->rb_right" (the second word of 'struct rb_node') ends up having
> that value in %rax:
> 
>   RAX: 343479726f6d656d
> 
> which is ASCII "44yromem" rather than a valid pointer if I looked that up right.

Yep, I was pretty sure it was related to the
"/sys/bus/memory/devices/memory44" sysfs object and bisection would lead to
kobject/sysfs or some memory hotplug related changes. So the result was a
surprise.

> And just _slightly_ different allocation patterns, and your 'struct
> rb_node' gets allocated somewhere else, and you don't see the oops at
> all, or you get it later in some different place.
> 
> Most memory corruption doesn't cause oopses, because most memory isn't
> used as pointers etc.
> 
> What you _could_ try if you care enough is
> 
>  - go back to the thing you bisectted to where you can still hopefully
> recreate the problem
> 
>  - apply that patch at that point with no other changes
> 
> and then the test would hopefully be closer to the state you could
> re-create the problem.
> 
> And hopefully it would still not reproduce, just because the bug is
> fixed, of course ;)

Yeah, that worked! Commit 40caa127f3c7 was still broken, and cherry-pick of
77e02cf57b6cf on top fixed it. Thanks!

> The very unlikely alternative is that your bisect was just pure random
> bad luck and hit the wrong commit entirely, and the oops was due to
> some other problem.
> 
> But it does seem unlikely to be something else. Usually when bisects
> go off into the weeds due to not being reproducible, they go very
> obviously off into the weeds rather than point to something that ends
> up having a very similar bug.
> 
>            Linus
> 





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux