Re: Oops in VMA code

Alexander Graf <agraf@xxxxxxx> · Thu, 16 Jun 2011 08:12:21 +0200

On 16.06.2011, at 08:02, Benjamin Herrenschmidt wrote:

> On Thu, 2011-06-16 at 07:32 +0200, Alexander Graf wrote:
>> On 16.06.2011, at 06:32, Linus Torvalds wrote:
> 
>> Thanks a lot for looking at it either way :).
> 
> Yeah thanks ;-) Let me see what I can dig out.
> 
> First it's a load from what looks like a valid pointer to the linear
> mapping that had one byte corrupted (or more but it looks reasonably
> "clean"). It's not a one bit error, there's at least 2 bad bits (the
> 09):
> 
> DAR: c00090026236bbc0
> 
> Alex, how much RAM do you have ? If that was just a one byte corruption,
> the above would imply you have something valid between 9 and 10G. From
> the look of other registers, it seems that it could be a genuine pointer
> with just that stay "09" byte that landed onto it.

Heh, you caught me to it. I was just writing up a reply to Linus explaining how I only have 8GB of RAM and how this address has more invalid bits than just the "09". It's either completely garbaged as of the 3rd byte or at least 0x9002 is wrong.

> 
>> The latter is the one I'm executing, while the former still has all
>> the symbols. But you're right. It looks like this is simply an inlined
>> function - which is why it got stripped away. Here's the disassembly
>> of the whole do_unmap function. I hope it's of use despite your fading
>> PPC asm skills :). Host compiler is gcc 4.3.4 from SLES11SP1.
> 
> .../...
> 
> Ok, so let's see what we can dig from here. It -looks- like:
> 
> if (!mm) goto out :
> 
>> 0xc000000000190554 <find_vma_prev>:	cmpdi   cr7,r3,0
>> 0xc000000000190558 <find_vma_prev+4>:	beq     cr7,0xc0000000001907f0 <remove_vma_list+836>
> 
> rb_node = mm->mm_rb.rb_node; (rb_node in r9):
> 
>> 0xc00000000019055c <find_vma_prev+8>:	ld      r9,8(r3)
> 
> vma = mm->mmap (vma in r28)
> 
>> 0xc000000000190560 <find_vma_prev+12>:	ld      r28,0(r3)
>> 0xc000000000190564 <find_vma_prev+16>:	li      r11,0
>> 0xc000000000190568 <find_vma_prev+20>:	li      r26,0
> 
> while(rb_node)...
> 
>> 0xc00000000019056c <find_vma_prev+24>:	cmpdi   cr7,r9,0
>> 0xc000000000190570 <find_vma_prev+28>:	bne     cr7,0xc000000000190594 <find_vma_prev+64>
>> 0xc000000000190574 <find_vma_prev+32>:	b       0xc0000000001905d0 <do_munmap+368>
>> 0xc000000000190578 <find_vma_prev+36>:	nop
>> 0xc00000000019057c <find_vma_prev+40>:	nop
>> 0xc000000000190580 <find_vma_prev+44>:	ld      r9,16(r9)
>> 0xc000000000190584 <find_vma_prev+48>:	mr      r26,r11
>> 0xc000000000190588 <find_vma_prev+52>:	cmpdi   cr7,r9,0
>> 0xc00000000019058c <find_vma_prev+56>:	mr      r11,r26
>> 0xc000000000190590 <find_vma_prev+60>:	beq     cr7,0xc0000000001905c4 <find_vma_prev+112>
> 
> vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> 
>> 0xc000000000190594 <find_vma_prev+64>:	addi    r26,r9,-56
> 
> if (vma_tmp->vm_end)
> 
>> 0xc000000000190598 <find_vma_prev+68>:	ld      r0,16(r26)
> 
> Here we go. So here vma_tmp is crap, which we got out of the rb_tree,
> so it's either corruption or use after free I'd say. It could also be a
> completely unrelated memory corruption of course....

I'm usually pretty sceptic on blaming hardware on memory corruption issues, so this would mean some random could would have overwritten things here. Sounds pretty hard to find to me.

> If you had xmon we could have dug a little bit more to see what's
> before/after etc... but like this it doesn't ring any special bell to
> me.

Yeah, I've since rebooted the machine :). Let's just leave it here and see if maybe someone else stumbles over the same thing, so we can potentially gather some data points. I'd claim it unlikely that this really is related to memory management code.

Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href