Le 25/10/2012 23:10, Dave Chinner a écrit :
This time, after 3.6.3 boot, one of my xfs volume refuse to mount :
mount: /dev/mapper/LocalDisk-debug--git: can't read superblock
276596.189363] XFS (dm-1): Mounting Filesystem
[276596.270614] XFS (dm-1): Starting recovery (logdev: internal)
[276596.711295] XFS (dm-1): xlog_recover_process_data: bad clientid 0x0
[276596.711329] XFS (dm-1): log mount/recovery failed: error 5
[276596.711516] XFS (dm-1): log mount failed
That's an indication that zeros are being read from the journal
rather than valid transaction data. It may well be caused by an XFS
bug, but from experience it is equally likely to be a lower layer
storage problem. More information is needed.
Hello dave, did you see the next mail ? The fact is that with 3.4.15,
journal is OK, and data is, in fact, intact.
Firstly:
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
OK, sorry I missed it : here are the informations. Not sure all is
relevant, anyway here we go.
each time I will distinguish between the first reported crash (nodes of
ceph) and the last one, as the setup is quite different.
--------
kernel version (uname -a) : 3.6.1 then 3.6.2, vanilla, hand compiled, no
proprietary modules. Not running it at the moment, can't give you the
exact uname -a
------------
xfs_repair version 3.1.7 on the the third machine,
xfs_repair version 3.1.4 on two first machines (part of ceph)
-----------
cpu : the same for the 3 machines : Dell PowerEdgme M610,
2x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz , Hyper threading
activated (12 physical cores, 24 virtual cores)
-------------
meminfo :
for example, on the 3rd machine :
MemTotal: 41198292 kB
MemFree: 28623116 kB
Buffers: 1056 kB
Cached: 10392452 kB
SwapCached: 0 kB
Active: 180528 kB
Inactive: 10227416 kB
Active(anon): 17476 kB
Inactive(anon): 180 kB
Active(file): 163052 kB
Inactive(file): 10227236 kB
Unevictable: 3744 kB
Mlocked: 3744 kB
SwapTotal: 506040 kB
SwapFree: 506040 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 18228 kB
Mapped: 12688 kB
Shmem: 300 kB
Slab: 1408204 kB
SReclaimable: 1281008 kB
SUnreclaim: 127196 kB
KernelStack: 1976 kB
PageTables: 2736 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 21105184 kB
Committed_AS: 136080 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 398608 kB
VmallocChunk: 34337979376 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7652 kB
DirectMap2M: 2076672 kB
DirectMap1G: 39845888 kB
----
/proc/mounts:
root@label5:~# cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=20592788k,nr_inodes=5148197,mode=755 0 0
devpts /dev/pts devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=4119832k,mode=755 0 0
/dev/mapper/LocalDisk-root / xfs rw,relatime,attr2,noquota 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,relatime,size=8239660k 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,relatime,size=8239660k 0 0
/dev/sda1 /boot ext2 rw,relatime,errors=continue 0 0
** /dev/mapper/LocalDisk-debug--git /mnt/debug-git xfs
rw,relatime,attr2,noquota 0 0 ** this one was the failing on 3.6.xx
configfs /sys/kernel/config configfs rw,relatime 0 0
ocfs2_dlmfs /dlm ocfs2_dlmfs rw,relatime 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0
This volume is on RAID1 localdisk.
on one of the first 2 nodes :
root@hanyu:~# cat /proc/mounts
rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=20592652k,nr_inodes=5148163,mode=755 0 0
none /dev/pts devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/disk/by-uuid/37dd603c-168c-49de-830d-ef1b5c6982f8 / xfs
rw,relatime,attr2,noquota 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
/dev/sdk1 /boot ext2 rw,relatime,errors=continue 0 0
none /var/local/cgroup cgroup
rw,relatime,net_cls,freezer,devices,memory,cpuacct,cpu,debug,cpuset 0 0
** /dev/mapper/xceph--hanyu-data /XCEPH-PROD/data xfs
rw,noatime,attr2,filestreams,nobarrier,inode64,logbsize=256k,noquota 0 0
** This one was the failed volume
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
Please note that on this server, nobarrier is used because the volume is
on a battery-backed fibre channel raid array.
--------------
/proc/partitions :
quite complicated on the ceph node :
root@hanyu:~# cat /proc/partitions
major minor #blocks name
11 0 1048575 sr0
8 32 6656000000 sdc
8 48 5063483392 sdd
8 64 6656000000 sde
8 80 5063483392 sdf
8 96 6656000000 sdg
8 112 5063483392 sdh
8 128 6656000000 sdi
8 144 5063483392 sdj
8 160 292421632 sdk
8 161 273073 sdk1
8 162 530145 sdk2
8 163 2369587 sdk3
8 164 289242292 sdk4
254 0 6656000000 dm-0
254 1 5063483392 dm-1
254 2 5242880 dm-2
254 3 11676106752 dm-3
please note that we use multipath here. 4 Paths for the LUN :
root@hanyu:~# multipath -ll
mpath2 (3600d02310006674500000001414d677d) dm-1 IFT,S16F-R1840-4
size=4.7T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=100 status=active
| |- 0:0:1:96 sdf 8:80 active ready running
| `- 6:0:1:96 sdj 8:144 active ready running
`-+- policy='round-robin 0' prio=20 status=enabled
|- 0:0:0:96 sdd 8:48 active ready running
`- 6:0:0:96 sdh 8:112 active ready running
mpath1 (3600d02310006674500000000414d677d) dm-0 IFT,S16F-R1840-4
size=6.2T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=100 status=active
| |- 0:0:1:32 sde 8:64 active ready running
| `- 6:0:1:32 sdi 8:128 active ready running
`-+- policy='round-robin 0' prio=20 status=enabled
|- 0:0:0:32 sdc 8:32 active ready running
`- 6:0:0:32 sdg 8:96 active ready running
On the 3rd machine, setup is quite simpler
root@label5:~# cat /proc/partitions
major minor #blocks name
8 0 292421632 sda
8 1 257008 sda1
8 2 506047 sda2
8 3 1261102 sda3
8 4 140705302 sda4
254 0 2609152 dm-0
254 1 104857600 dm-1
254 2 31457280 dm-2
--------------
raid layout :
On the first 2 machines (part of ceph cluster), the data is on Raid5 on
a fibre channel raid array, accessed by emulex fibre channel
(lightpulse, lpfc)
On the 3rd, data is on Raid1 accessed by Dell Perc (LSI Logic / Symbios
Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) driver mptsas)
--------------
LVM config :
root@hanyu:~# vgs
VG #PV #LV #SN Attr VSize VFree
LocalDisk 1 1 0 wz--n- 275,84g 270,84g
xceph-hanyu 2 1 0 wz--n- 10,91t 41,36g
root@hanyu:~# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
log LocalDisk -wi-a- 5,00g
data xceph-hanyu -wi-ao 10,87t
and
root@label5:~# vgs
VG #PV #LV #SN Attr VSize VFree
LocalDisk 1 3 0 wz--n- 134,18g 1,70g
root@label5:~# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
1 LocalDisk -wi-a- 30,00g
debug-git LocalDisk -wi-ao 100,00g
root LocalDisk -wi-ao 2,49g
root@label5:~#
-------------------
type of disks :
on the raid array I'd say not very important (SEAGATE ST32000444SS near
line sas 2TB)
on the 3rd machine : TOSHIBA MBF2300RC DA06
---------------------
write cache status :
on the raid array, write cache is activated globally for the raid array
BUT is explicitely disabled on drives.
on the 3rd machine, it is disabled as far as I know
-------------------
Size of BBWC : 2 or 4 GB on raid arrays. None on the 3rd.
------------------
xfs_info :
root@hanyu:~# xfs_info /dev/xceph-hanyu/data
meta-data=/dev/mapper/xceph--hanyu-data isize=256 agcount=11,
agsize=268435455 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=2919026688, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
(no sunit or swidth on this one)
root@label5:~# xfs_info /dev/LocalDisk/debug-git
meta-data=/dev/mapper/LocalDisk-debug--git isize=256 agcount=4,
agsize=6553600 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=26214400, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=12800, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
-----
dmesg : you already have the informations.
For iostat, etc, I need to try to reproduce the load.
Secondly, is the system still in this state? If so, dump the log to
No. The first 2 nodes have been xfs_repaired. One was completed and it
was a terrible mess.
The second had xfs_repair segfaulting. Will try with a newer xfs_repair
on a 3.4 kernel.
The 3rd one is now ok, after booting on 3.4 kernel.
a file using xfs_logprint, zip it up and send it to me so I can have
a look at where the log is intact (i.e. likely xfs bug) or contains
zero (likely storage bug).
If the system is not still in this state, then I'm afraid there's
nothing that can be done to understand the problem.
I'll try to reproduce a similar problem.
You've had two machines crash with problems in the mm subsystem, and
one filesystem problem that might be hardware realted. Bit early to
be blaming XFS for all your problems, I think....
I don't try to blame XFS. I'm very confident in it, and since a long
time. BUT I see a very different behaviour on those 3 cases. Nothing
conclusive yet. I think the problem is related with kernel 3.6, maybe in
dm layer.
I don't think it's hardware related : different disks, differents
controllers, different machines.
The common point is :
-XFS
-Kernel 3.6.xx
-Device Mapper + LVM
xfs_repair -n seems to show volume is quite broken :
Sure, if the log hasn't been replayed then it will be - the
filesystem will only be consistent after log recovery has been run.
Yes, but I had to use xfs_repair -L in the past (power outage, hardware
failures) and never had such disastrous repairs.
At least on the 2 first failures, I can understand : There is lots of
data, Journal is BIG, and I/O transactions in flight are quite high.
on the 3rd failure I'm very septical : low I/O load, little volume.
You should report the mm problems to linux-mm@xxxxxxxxx to make sure
the right people see them and they don't get lost in the noise of
lkml....
yes point taken,
I'll try now to reproduce this kind of behaviour on a verry little
volume (10 GB for exemple) so I can confirm or inform the given scenario .
Thanks for your time,
-- Yann Dupont - Service IRTS, DSI Université de Nantes Tel :
02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx
_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs