2.6.24-rc6 reproducible raid5 hang

dean gaudet <dean@xxxxxxxxxx> · Thu, 27 Dec 2007 09:06:22 -0800 (PST)

hey neil -- remember that raid5 hang which me and only one or two others 
ever experienced and which was hard to reproduce?  we were debugging it 
well over a year ago (that box has 400+ day uptime now so at least that 
long ago :)  the workaround was to increase stripe_cache_size... i seem to 
have a way to reproduce something which looks much the same.

setup:

- 2.6.24-rc6
- system has 8GiB RAM but no swap
- 8x750GB in a raid5 with one spare, chunksize 1024KiB.
- mkfs.xfs default options
- mount -o noatime
- dd if=/dev/zero of=/mnt/foo bs=4k count=2621440

that sequence hangs for me within 10 seconds... and i can unhang / rehang 
it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
by watching "iostat -kx /dev/sd? 5".

i've attached the kernel log where i dumped task and timer state while it 
was hung... note that you'll see at some point i did an xfs mount with 
external journal but it happens with internal journal as well.

looks like it's using the raid456 module and async api.

anyhow let me know if you need more info / have any suggestions.

-dean
Attachment:
config-2.6.24-rc6-neemlark1.bz2

Description: Binary data
Attachment:
kern.log.bz2

Description: Binary data