Re: Problems with kernel 3.6.x (vm ?) (was : Is kernel 3.6.1 or filestreams option toxic ?)

Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> · Fri, 26 Oct 2012 12:03:56 +0200

Le 25/10/2012 23:10, Dave Chinner a écrit :

This time, after 3.6.3 boot, one of my xfs volume refuse to mount :

mount: /dev/mapper/LocalDisk-debug--git: can't read superblock

276596.189363] XFS (dm-1): Mounting Filesystem
[276596.270614] XFS (dm-1): Starting recovery (logdev: internal)
[276596.711295] XFS (dm-1): xlog_recover_process_data: bad clientid 0x0
[276596.711329] XFS (dm-1): log mount/recovery failed: error 5
[276596.711516] XFS (dm-1): log mount failed
That's an indication that zeros are being read from the journal
rather than valid transaction data. It may well be caused by an XFS
bug, but from experience it is equally likely to be a lower layer
storage problem. More information is needed.

Hello dave, did you see the next mail ? The fact is that with 3.4.15, 
journal is OK, and data is, in fact, intact.

Firstly:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

OK, sorry I missed it : here are the informations. Not sure all is 
relevant, anyway here we go.
each time I will distinguish between the first reported crash (nodes of 
ceph) and the last one, as the setup is quite different.

--------

kernel version (uname -a) : 3.6.1 then 3.6.2, vanilla, hand compiled, no 
proprietary modules. Not running it at the moment, can't give you the 
exact uname -a

------------
xfs_repair version 3.1.7 on the the third machine,
xfs_repair version 3.1.4 on two first machines (part of ceph)
-----------
cpu : the same for the 3 machines : Dell PowerEdgme M610,
2x Intel(R) Xeon(R) CPU           E5649  @ 2.53GHz , Hyper threading 
activated (12 physical cores, 24 virtual cores)

-------------
meminfo :
for example, on the 3rd machine :

MemTotal:       41198292 kB
MemFree:        28623116 kB
Buffers:            1056 kB
Cached:         10392452 kB
SwapCached:            0 kB
Active:           180528 kB
Inactive:       10227416 kB
Active(anon):      17476 kB
Inactive(anon):      180 kB
Active(file):     163052 kB
Inactive(file): 10227236 kB
Unevictable:        3744 kB
Mlocked:            3744 kB
SwapTotal:        506040 kB
SwapFree:         506040 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         18228 kB
Mapped:            12688 kB
Shmem:               300 kB
Slab:            1408204 kB
SReclaimable:    1281008 kB
SUnreclaim:       127196 kB
KernelStack:        1976 kB
PageTables:         2736 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    21105184 kB
Committed_AS:     136080 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      398608 kB
VmallocChunk:   34337979376 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7652 kB
DirectMap2M:     2076672 kB
DirectMap1G:    39845888 kB

----
/proc/mounts:

root@label5:~# cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=20592788k,nr_inodes=5148197,mode=755 0 0
devpts /dev/pts devpts 
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=4119832k,mode=755 0 0
/dev/mapper/LocalDisk-root / xfs rw,relatime,attr2,noquota 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,relatime,size=8239660k 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,relatime,size=8239660k 0 0
/dev/sda1 /boot ext2 rw,relatime,errors=continue 0 0
** /dev/mapper/LocalDisk-debug--git /mnt/debug-git xfs 
rw,relatime,attr2,noquota 0 0 ** this one was the failing on 3.6.xx
configfs /sys/kernel/config configfs rw,relatime 0 0
ocfs2_dlmfs /dlm ocfs2_dlmfs rw,relatime 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0

This volume is on RAID1 localdisk.

on one of the first 2 nodes :

root@hanyu:~# cat /proc/mounts
rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=20592652k,nr_inodes=5148163,mode=755 0 0
none /dev/pts devpts 
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/disk/by-uuid/37dd603c-168c-49de-830d-ef1b5c6982f8 / xfs 
rw,relatime,attr2,noquota 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
/dev/sdk1 /boot ext2 rw,relatime,errors=continue 0 0
none /var/local/cgroup cgroup 
rw,relatime,net_cls,freezer,devices,memory,cpuacct,cpu,debug,cpuset 0 0
** /dev/mapper/xceph--hanyu-data /XCEPH-PROD/data xfs 
rw,noatime,attr2,filestreams,nobarrier,inode64,logbsize=256k,noquota 0 0 
** This one was the failed volume
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0

Please note that on this server, nobarrier is used because the volume is 
on a battery-backed fibre channel raid array.
--------------
/proc/partitions :
quite complicated on the ceph node :

root@hanyu:~#  cat /proc/partitions
major minor  #blocks  name

  11        0    1048575 sr0
   8       32 6656000000 sdc
   8       48 5063483392 sdd
   8       64 6656000000 sde
   8       80 5063483392 sdf
   8       96 6656000000 sdg
   8      112 5063483392 sdh
   8      128 6656000000 sdi
   8      144 5063483392 sdj
   8      160  292421632 sdk
   8      161     273073 sdk1
   8      162     530145 sdk2
   8      163    2369587 sdk3
   8      164  289242292 sdk4
 254        0 6656000000 dm-0
 254        1 5063483392 dm-1
 254        2    5242880 dm-2
 254        3 11676106752 dm-3

please note that we use multipath here. 4 Paths for the LUN :

root@hanyu:~# multipath -ll
mpath2 (3600d02310006674500000001414d677d) dm-1 IFT,S16F-R1840-4
size=4.7T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=100 status=active
| |- 0:0:1:96 sdf 8:80  active ready  running
| `- 6:0:1:96 sdj 8:144 active ready  running
`-+- policy='round-robin 0' prio=20 status=enabled
  |- 0:0:0:96 sdd 8:48  active ready  running
  `- 6:0:0:96 sdh 8:112 active ready  running
mpath1 (3600d02310006674500000000414d677d) dm-0 IFT,S16F-R1840-4
size=6.2T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=100 status=active
| |- 0:0:1:32 sde 8:64  active ready  running
| `- 6:0:1:32 sdi 8:128 active ready  running
`-+- policy='round-robin 0' prio=20 status=enabled
  |- 0:0:0:32 sdc 8:32  active ready  running
  `- 6:0:0:32 sdg 8:96  active ready  running

On the 3rd machine, setup is quite simpler

root@label5:~# cat /proc/partitions
major minor  #blocks  name

   8        0  292421632 sda
   8        1     257008 sda1
   8        2     506047 sda2
   8        3    1261102 sda3
   8        4  140705302 sda4
 254        0    2609152 dm-0
 254        1  104857600 dm-1
 254        2   31457280 dm-2

--------------

raid layout :

On the first 2 machines (part of ceph cluster), the data is on Raid5 on 
a fibre channel raid array, accessed by emulex fibre channel 
(lightpulse, lpfc)
On the 3rd, data is on Raid1 accessed by Dell Perc (LSI Logic / Symbios 
Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) driver mptsas)

--------------

LVM config :
root@hanyu:~# vgs
  VG          #PV #LV #SN Attr   VSize   VFree
  LocalDisk     1   1   0 wz--n- 275,84g 270,84g
  xceph-hanyu   2   1   0 wz--n-  10,91t  41,36g

root@hanyu:~# lvs
  LV   VG          Attr   LSize  Origin Snap%  Move Log Copy% Convert
  log  LocalDisk   -wi-a- 5,00g
  data xceph-hanyu -wi-ao 10,87t

and

root@label5:~# vgs
  VG        #PV #LV #SN Attr   VSize   VFree
  LocalDisk   1   3   0 wz--n- 134,18g 1,70g

root@label5:~# lvs
  LV        VG        Attr   LSize   Origin Snap%  Move Log Copy% Convert
  1         LocalDisk -wi-a- 30,00g
  debug-git LocalDisk -wi-ao 100,00g
  root      LocalDisk -wi-ao 2,49g
root@label5:~#

-------------------

type of disks :

on the raid array I'd say not very important (SEAGATE ST32000444SS near 
line sas 2TB)
on the 3rd machine : TOSHIBA  MBF2300RC        DA06

---------------------

write cache status :

on the raid array, write cache is activated globally for the raid array 
BUT is explicitely disabled on drives.
on the 3rd machine, it is disabled as far as I know

-------------------

Size of BBWC : 2 or 4 GB on raid arrays. None on the 3rd.

------------------
xfs_info :

root@hanyu:~# xfs_info /dev/xceph-hanyu/data
meta-data=/dev/mapper/xceph--hanyu-data isize=256    agcount=11, 
agsize=268435455 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=2919026688, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

(no sunit or swidth on this one)

root@label5:~# xfs_info /dev/LocalDisk/debug-git
meta-data=/dev/mapper/LocalDisk-debug--git isize=256    agcount=4, 
agsize=6553600 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

-----

dmesg : you already have the informations.

For iostat, etc, I need to try to reproduce the load.

Secondly, is the system still in this state? If so, dump the log to

No. The first 2 nodes have been xfs_repaired. One was completed and it 
was a terrible mess.
The second had xfs_repair segfaulting. Will try with a newer xfs_repair 
on a 3.4 kernel.

The 3rd one is now ok, after booting on 3.4 kernel.

a file using xfs_logprint, zip it up and send it to me so I can have
a look at where the log is intact (i.e. likely xfs bug) or contains
zero (likely storage bug).

If the system is not still in this state, then I'm afraid there's
nothing that can be done to understand the problem.

I'll try to reproduce a similar problem.

You've had two machines crash with problems in the mm subsystem, and
one filesystem problem that might be hardware realted. Bit early to
be blaming XFS for all your problems, I think....

I don't try to blame XFS. I'm very confident in it, and since a long 
time. BUT I see a very different behaviour on those 3 cases. Nothing 
conclusive yet. I think the problem is related with kernel 3.6, maybe in 
dm layer.
I don't think it's hardware related : different disks, differents 
controllers, different machines.

The common point is :
-XFS
-Kernel 3.6.xx
-Device Mapper + LVM

xfs_repair -n seems to show volume is quite broken :
Sure, if the log hasn't been replayed then it will be - the
filesystem will only be consistent after log recovery has been run.

Yes, but I had to use xfs_repair -L in the past (power outage, hardware 
failures) and never had such disastrous repairs.

At least on the 2 first failures, I can understand : There is lots of 
data, Journal is BIG, and I/O transactions in flight are quite high.
on the 3rd failure I'm very septical : low I/O load, little volume.

You should report the mm problems to linux-mm@xxxxxxxxx to make sure
the right people see them and they don't get lost in the noise of
lkml....

yes point taken,

I'll try now to reproduce this kind of behaviour on a verry little 
volume (10 GB for exemple) so I can confirm or inform the given scenario .

Thanks for your time,

-- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 
02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs