Interesting possible XFS crash condition

Shawn Usry <shawn@xxxxxxxxxxxxxxxx> · Wed, 20 Oct 2010 01:13:19 -0500

 Hi List -

First off, thanks for the great filesystem.  Thus far it's been an 
excellent performer for my needs both professionally and personally.

I have a situation/environment that is producing a kernel crash, that 
may be XFS related.   A colleague suggested I post to this list as there 
may be some interest in reproducing it.

Environment (current):

Fedora core 13 (kernel 2.6.34.7-56.fc13.i686)
xfsprogs-3.1.1-7.fc13.i686

RAID5 Controller:  3ware 9550-SXU-8LP 8-port sata controller, 64-bit PCI-X.
XFS filesystem in question is on a RAID 5 array on this controller, made 
of up 4 identical disks, 1.5TB each, 64k stripe (block device = /dev/sdb)

The setup:
ORIGINALLY, this was a 3-disk RAID5.  Created the XFS filesystem with:
  --> mkfs -t xfs /dev/sdb

All was well in use to this point.

Next, I ADDED a 4th disk to the array, and expanded the array in place; 
and operation supported by this RAID device.
New usable size = 4.5T

Once completed, I grew the XFS filesystem with xfs_growxfs to expand 
into the full size of the new array.

Again, all was well, for about a week of normal use - fairly heavy 
copy/read/write operations on a regular basis.

Then, without any changes or warning (that I was aware of at least), the 
machine started crashing/kernel panic anytime I accessed (read/write) 
MOST of the files in the filesystem.   Some files could be accessed 
without a problem.  In general though any kind of high I/O (copying a 
file (not moving) to the same device, copying to another block 
device/disk, reading it across the network, etc) now causes the 
condition, observed by access occurring normally for the first few MB 
(this seems to vary in value) and then the system locking up completely.

Most of the time, the system becomes unresponsive and must be rebooted 
to gain access again.   In some cases though, system access will return, 
on a limited / choppy basis and messages like "card reset" will appear 
in the message log.

The latter statement and observations lead me to believe that perhaps 
this was simply a yucky controller that was failing under heavy I/O.   
However, several other tests/observations leave me wondering if it may 
be a corrupt filesystem in some way, that is not being detected by 
xfs_repair.

Tests / Observations:

1.  Mounted, or Unmounted, I can "dd" the block device array (/dev/sdb)  
all day long without a problem:
--> dd if=/dev/sdb of=/dev/null bs=(varied tests)    result:  end to end 
no problem
--> dd if=/dev/sdb of=/tmp/test.file bs=(varies)  result:   no problem 
(as long as test.file space permits..)

2.  I can CREATE arbitrary NEW files onto the filesystem, and copy them 
/read them OFF the device, such as a disk-to-otherdisk, 
disk-to-samedisk, copy across the network, etc, read them, delete, them 
- NO CRASH.
--> dd if=/dev/zero of=/myblkdevice/test.file bs=1M count=1024 (create 
an arbitrary 1GB file).  All normal.

3.  Copying / Reading existing files (at least, that existed at the time 
I grew the array) seems to trigger the system crash.  Copying/reading 
said NEW files (i.e., #2 above) does NOT trigger the crash.

4.  Copying EXISTING files from other servers / locations on the 
network, or other disks,  to the device triggers the crash (i.e., would 
be a NEW file being copied to the array, but not created ON the array).

5.  Unmounted, xfs_repair -n /dev/sdb ---> finds no issues

6.  Unmounted, xfs_repair /dev/sdb ---> finds no issues, performs no 
changes.

Other Notes:
1.  I did recently learn of the create-time and mount-time options 
sunit/swidth for optimizing performance.   Setting these had no effect 
on this issue.

2.  SOME files behave perfectly normal. I can copy them, read them, etc 
without a problem.  But for the MOST part, MOST files, and MOST all file 
operations seem to trigger the crash, though

3.  Limited information shown in what I've been able to capture in the 
kernel crash.  Nothing really specific or repeatable (different message 
each time) - some instances to the term "atomic" and "xfs" - other times 
"irq" related.

4.  In general the crash seems to happen when I either:
   a.  Attempt to do any reads of files larger than 100 MB or so 
(small, single operations don't seem to have an effect, but strings of 
small operations (unzipping a dir of files, for example) does).
   b.  Attempt to move or copy any data to the filesystem that didn't 
ORIGINATE on the filesystem.

Questions:
1.  Is is possible that my raid-expansion on the 3ware board brought on 
some kind of corruption?   Might not xfs_repair detect this if so?

2.  Are there any thoughts / patches / commands / debug options I might 
try to resolve this?

3.  Is this more likely a problem with the 3ware controller + XFS 
combination?

The only recourse I've thought of is to completely wipe the array and 
start from scratch with a fresh 4-disk array, and XFS filesystem 
creation, then copy data back to it.

I can't leave this device in place in an unusable state very long - I 
just thought this list might be interested in the conditions.   Any 
suggestions or thoughts would be greatly appreciated.  Resolving this 
would save me a good deal of time.

Shawn

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs