Re: XFS Metadata corruption while activating OSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





在 2018年3月12日,上午9:49,Christian Wuerdig <christian.wuerdig@xxxxxxxxx> 写道:

Hm, so you're running OSD nodes with 2GB of RAM and 2x10TB = 20TB of storage? Literally everything posted on this list in relation to HW requirements and related problems will tell you that this simply isn't going to work. The slightest hint of a problem will simply kill the OSD nodes with OOM. Have you tried with smaller disks - like 1TB models (or even smaller like 256GB SSDs) and see if the same problem persists?

Thank you for your reply.
I am sorry for my late reply.
You are right , when the backend is bluestore , there was OOM from time to time.
Now will upgrade our HW to see whether we avoid OOM.
Besides, after we upgrade kernel from 4.4.39 to 4.4.120, the activating osd xfs error seems to be fixed.



On Tue, 6 Mar 2018 at 10:51, 赵赵贺东 <zhaohedong@xxxxxxxxx> wrote:
Hello ceph-users,

It is a really really Really tough problem for our team.
We investigated in the problem for a long time, try a lot of efforts, but can’t solve the problem, even the concentrate cause of the problem is still unclear for us!
So, Anyone give any solution/suggestion/opinion whatever  will be highly highly appreciated!!!

Problem Summary:
When we activate osd, there will be  metadata corrupttion in the activating disk, probability is 100% !

Admin Nodes&MON node:
Platform: X86
OS: Ubuntu 16.04
Kernel: 4.12.0
Ceph: Luminous 12.2.2

OSD nodes:
Platform: armv7
OS:       Ubuntu 14.04
Kernel:   4.4.39
Ceph: Lominous 12.2.2
Disk: 10T+10T
Memory: 2GB

Deploy log:


dmesg log:(Sorry arms001-01 dmesg log has log has been lost, but error message about metadata corruption on arms003-10 are the same with arms001-01)
Mar  5 11:08:49 arms003-10 kernel: [  252.534232] XFS (sda1): Unmount and run xfs_repair
Mar  5 11:08:49 arms003-10 kernel: [  252.539100] XFS (sda1): First 64 bytes of corrupted metadata buffer:
Mar  5 11:08:49 arms003-10 kernel: [  252.545504] eb82f000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  XFSB.........s..
Mar  5 11:08:49 arms003-10 kernel: [  252.553569] eb82f010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Mar  5 11:08:49 arms003-10 kernel: [  252.561624] eb82f020: fc 4e e3 89 50 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.....n../
Mar  5 11:08:49 arms003-10 kernel: [  252.569706] eb82f030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  ................
Mar  5 11:08:49 arms003-10 kernel: [  252.577778] XFS (sda1): metadata I/O error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
Mar  5 11:08:49 arms003-10 kernel: [  252.602944] XFS (sda1): Metadata corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
Mar  5 11:08:49 arms003-10 kernel: [  252.614170] XFS (sda1): Unmount and run xfs_repair
Mar  5 11:08:49 arms003-10 kernel: [  252.619030] XFS (sda1): First 64 bytes of corrupted metadata buffer:
Mar  5 11:08:49 arms003-10 kernel: [  252.625403] eb901000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  XFSB.........s..
Mar  5 11:08:49 arms003-10 kernel: [  252.633441] eb901010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Mar  5 11:08:49 arms003-10 kernel: [  252.641474] eb901020: fc 4e e3 89 50 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.....n../
Mar  5 11:08:49 arms003-10 kernel: [  252.649519] eb901030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  ................
Mar  5 11:08:49 arms003-10 kernel: [  252.657554] XFS (sda1): metadata I/O error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
Mar  5 11:08:49 arms003-10 kernel: [  252.675056] XFS (sda1): Metadata corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
Mar  5 11:08:49 arms003-10 kernel: [  252.686228] XFS (sda1): Unmount and run xfs_repair
Mar  5 11:08:49 arms003-10 kernel: [  252.691054] XFS (sda1): First 64 bytes of corrupted metadata buffer:
Mar  5 11:08:49 arms003-10 kernel: [  252.697425] eb901000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  XFSB.........s..
Mar  5 11:08:49 arms003-10 kernel: [  252.705459] eb901010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Mar  5 11:08:49 arms003-10 kernel: [  252.713489] eb901020: fc 4e e3 89 50 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.....n../
Mar  5 11:08:49 arms003-10 kernel: [  252.721520] eb901030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  ................
Mar  5 11:08:49 arms003-10 kernel: [  252.729558] XFS (sda1): metadata I/O error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
Mar  5 11:08:49 arms003-10 kernel: [  252.741953] XFS (sda1): Metadata corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0x48b9ff80
Mar  5 11:08:49 arms003-10 kernel: [  252.753139] XFS (sda1): Unmount and run xfs_repair
Mar  5 11:08:49 arms003-10 kernel: [  252.757955] XFS (sda1): First 64 bytes of corrupted metadata buffer:
Mar  5 11:08:49 arms003-10 kernel: [  252.764336] eb901000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  XFSB.........s..
Mar  5 11:08:49 arms003-10 kernel: [  252.772365] eb901010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Mar  5 11:08:49 arms003-10 kernel: [  252.780395] eb901020: fc 4e e3 89 50 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.....n../
Mar  5 11:08:49 arms003-10 kernel: [  252.788417] eb901030: 00 00 00 00 80 00 00 07 ff ff ff ff ff ff ff ff  ................
Mar  5 11:08:49 arms003-10 kernel: [  252.796514] XFS (sda1): metadata I/O error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8

Our tries for solving the problem:
1.Delploy osd manually, still got the same error has been confirmed.
2.Browse kernel bug fix log, but no related bug fix log has been found since kernel 4.4.39.
3.Upgrade xfsprogs from 3.1.9 to 4.15.0, error number changed, still but disk will be corrupted while activating osd!

[2912641.987937] XFS (sda1): Metadata CRC error detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0xfffffff0
[2912641.999203] XFS (sda1): Unmount and run xfs_repair
[2912642.004202] XFS (sda1): First 64 bytes of corrupted metadata buffer:
[2912642.010759] e689a000: 58 46 53 42 00 00 10 00 00 00 00 00 91 73 fe fb  XFSB.........s..
[2912642.018958] e689a010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[2912642.027177] e689a020: 61 7b 64 0d fa fe 41 14 bf ea 90 32 6c 73 e5 ad  a{d...A....2ls..
[2912642.035388] e689a030: 00 00 00 00 50 00 00 08 ff ff ff ff ff ff ff ff  ....P...........
[2912642.043630] XFS (sda1): metadata I/O error: block 0xfffffff0 ("xfs_trans_read_buf_map") error 74 numblks 8
[2912642.060390] XFS (sda1): Metadata CRC error detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data block 0xfffffff0
[2912642.071673] XFS (sda1): Unmount and run xfs_repair
   
4.Use the disk as OSD node on X86 will not trigger the the problem has been confirmed.
5.Use sgdisk & mkfs.xfs to format the disk, and mount do some read&write dd test then unmount, will not trigger the problem has been confirmed.
6.Chang ceph version form 12.2.2 to 10.2.10, the problem still exist has been confirmed.
   10.2.0 Deploy log
  
   Corruption  error is the same as 12.2.2.










_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux