Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?)

Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> · Thu, 25 Oct 2012 17:21:35 +0200

Le 23/10/2012 10:24, Yann Dupont a écrit :
Le 22/10/2012 16:14, Yann Dupont a écrit :

Hello. This mail is a follow up of a message on XFS mailing list. I 
had hang with 3.6.1, and then , damage on XFS filesystem.

3.6.1 is not alone. Tried 3.6.2, and had another hang with quite a 
different trace this time , so not really sure the 2 problems are 
related .
Anyway the problem is maybe not XFS, but is just a consequence of what 
seems more like kernel problems.

cc: to linux-kernel
Hello.
There is definitively something wrong in 3.6.xx with XFS, in particular 
after an abrupt stop of the machine :

I now have corruption on a 3rd machine (not involved with ceph).
The machine was just rebooting from 3.6.2 kernel to 3.6.3 kernel.

This machine isn't under heavy load, but it's a machine we use for tests 
& compilations. We often crash it. For 2 years, we didn't have problems. 
XFS always was reliable, even in hard conditions (hard reset, loss of 
power, etc)

This time, after 3.6.3 boot, one of my xfs volume refuse to mount :

mount: /dev/mapper/LocalDisk-debug--git: can't read superblock

276596.189363] XFS (dm-1): Mounting Filesystem
[276596.270614] XFS (dm-1): Starting recovery (logdev: internal)
[276596.711295] XFS (dm-1): xlog_recover_process_data: bad clientid 0x0
[276596.711329] XFS (dm-1): log mount/recovery failed: error 5
[276596.711516] XFS (dm-1): log mount failed

I'm not even sure the reboot was after a crash or just a clean reboot. 
(I'm not the only one to use this machine). I have nothing suspect on my 
remote syslog.

Anyway, it's the 3rd XFS crashed volume in a row with 3.6 kernel. 
Different machines, different contexts. Looks suspicious.

This time the crashed volume was handled by a PERC (mptsas) card. The 2 
others volumes previously reported were handled by emulex lightpulse 
fibre channel card (lpfc) and this time filestreams option wasn't used.

xfs_repair -n seems to show volume is quite broken :

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
block (1,6197-6197) multiply claimed by bno space tree, state - 2
bad magic # 0x7f454c46 in btbno block 3/2320
expected level 0 got 513 in btbno block 3/2320
bad btree nrecs (256, min=255, max=510) in btbno block 3/2320
invalid start block 16793088 in record 0 of bno btree block 3/2320
invalid start block 0 in record 1 of bno btree block 3/2320
invalid start block 0 in record 2 of bno btree block 3/2320
invalid start block 2282029056 in record 3 of bno btree block 3/2320
invalid start block 0 in record 4 of bno btree block 3/2320
invalid length 218106368 in record 5 of bno btree block 3/2320
invalid start block 1684369509 in record 6 of bno btree block 3/2320
invalid start block 6909556 in record 7 of bno btree block 3/2320
invalid start block 1493202533 in record 8 of bno btree block 3/2320
invalid start block 1768111411 in record 9 of bno btree block 3/2320
invalid start block 761557865 in record 10 of bno btree block 3/2320
invalid start block 842084400 in record 11 of bno btree block 3/2320
...
bad magic # 0x41425442 in btcnt block 2/14832
bad btree nrecs (436, min=255, max=510) in btcnt block 2/14832
out-of-order cnt btree record 2 (188545 1) block 2/14832
out-of-order cnt btree record 3 (188650 1) block 2/14832
out-of-order cnt btree record 4 (188658 1) block 2/14832
out-of-order cnt btree record 8 (189021 1) block 2/14832
out-of-order cnt btree record 9 (189104 1) block 2/14832
out-of-order cnt btree record 10 (189127 2) block 2/14832
out-of-order cnt btree record 11 (189193 2) block 2/14832
out-of-order cnt btree record 12 (189259 2) block 2/14832
out-of-order cnt btree record 13 (189268 1) block 2/14832
out-of-order cnt btree record 14 (189307 1) block 2/14832
out-of-order cnt btree record 15 (189330 1) block 2/14832
out-of-order cnt btree record 16 (189379 1) block 2/14832
out-of-order cnt btree record 18 (189477 1) block 2/14832

I won't try to repair this volume right now.

This time, volume is small enough to make an image (it's a 100 GB lvm 
volume). I'll try to image it before making anything else.

1st question : I saw there is ext4 corruption reported too with 3.6 
kernel, but as far as I can see, problem seems to be jbd related, so it 
shouldn't affect xfs ?
2nd question : Am I the only one to see this ?? I saw problems reported 
with 2.6.37, but here, the kernel is 3.6.xx

3rd question : If you suspect the problem may be lying in XFS , what 
should I supply to help debugging the problem ?

Not CC:ing linux kernel list right now, as I'm really not sure where the 
problem is right now.

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs