Re: [Bug 45351] General protection fault in raid5, load_balance

Jim Kukunas <james.t.kukunas@xxxxxxxxxxxxxxx> · Thu, 29 Nov 2012 13:54:53 -0800

On Thu, Nov 29, 2012 at 12:24:48PM +1100, Neil Brown wrote:
> On Wed, 28 Nov 2012 10:33:06 +0000 (UTC) bugzilla-daemon@xxxxxxxxxxxxxxxxxxx
> wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=45351
> > 
> > 
> > Cyril B. <cbay@xxxxxxxxxxxxx> changed:
> > 
> >            What    |Removed                     |Added
> > ----------------------------------------------------------------------------
> >      Kernel Version|3.5.0                       |3.5.0, 3.6.8
> > 
> > 
> > 
> > 
> > --- Comment #1 from Cyril B. <cbay@xxxxxxxxxxxxx>  2012-11-28 10:33:05 ---
> > I've just tested 3.6.8, I still get the same bug/trace.
> > 
> 
> Hi Jim,
>  could you look at this bug please?

Hi Neil,

Thank you for bringing this to my attention.

> 
> https://bugzilla.kernel.org/show_bug.cgi?id=45351
> 
> It seems to be crashing in xor_avx_4:
> 
> [48595.135046] general protection fault: 0000 [#1] SMP
> [48595.135093] CPU 0
> [48595.135098] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport coretemp
> hwmon i2c_i801 shpchp pci_hotplug ehci_hcd usbcore usb_common netconsole e1000e
> [last unloaded: scsi_wait_scan]
> [48595.135211]
> [48595.135224] Pid: 2429, comm: md4_raid5 Not tainted 3.5.0 #2                 
> /DH67BL
> [48595.135263] RIP: 0010:[<ffffffff813512d8>]  [<ffffffff813512d8>] xor_avx_4+0x48/0x350
> [48595.135303] RSP: 0018:ffff880213a259d0  EFLAGS: 00010282
> [48595.135323] RAX: 000000008005003b RBX: 0000000000000008 RCX: ffff8802130b5000
> [48595.135346] RDX: ffff880212c9f000 RSI: ffff880212c9e000 RDI: 0000000000001000
> [48595.135368] RBP: ffff880213a25ac0 R08: ffff8802130b4000 R09: ffff880212c9e000
> [48595.135391] R10: ffff880212c9e000 R11: 0000000000000000 R12: 000000008005003b
> [48595.135413] R13: 0000000000000003 R14: ffff880213a25cd0 R15: 0000000000001000
> [48595.135436] FS:  0000000000000000(0000) GS:ffff88021fa00000(0000) knlGS:0000000000000000
> [48595.135471] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [48595.135492] CR2: 000000000235f570 CR3: 0000000001c0b000 CR4: 00000000000407f0
> [48595.135514] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [48595.135537] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> ....
> 
> [48595.136063] Code: b5 30 ff ff ff 48 89 95 28 ff ff ff 48 89 8d 20 ff ff ff 4c 89 85 18 ff ff ff e8 c4 04 ce ff 66 90 49 89 c4 0f 06 66 66 90 66 90 <c5> fc 29 85 50 ff ff ff c5 fc 29 8d 70 ff ff ff c5 fc 29 55 90

The code dump above is quiet revealing. The relevant instructions are:

	clts
	vmovaps	%ymm0,	-0xb0(%rbp)
	vmovaps %ymm1, 	-0x90(%rbp)
	vmovaps %ymm2,	-0x70(%rbp)

These instructions save the floating point state before we begin the
actual xor work. Looking at the register dump, -0xb0(%rbp) is not
properly aligned to 32 bytes, hence the #GP.

The question is whether the #GP still occurs after
841e3604d35aa70d399146abdc526d8c89a2c2f5.

Before that commit, we manually saved and restored the floating point state
to the stack with the YMMS_{SAVE,RESTORE} macros. After that commit, we
use the kernel_fpu_{begin,end} routines. In the former case, it would seem
GCC is ignoring our request to align the stack variable to 32-bytes and
841e3604d35aa70d399146abdc526d8c89a2c2f5 should resolve the problem. In the
later case, we will need to investigate further.

Thanks.

-- 
Jim Kukunas
Intel Open Source Technology Center
Attachment:
pgpBuOhXpSI7W.pgp

Description: PGP signature