OSD recovery failed because of "leveldb: Corruption : checksum mismatch"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear  guys :

 

I have a ceph cluster which is used for backend storage of kvm guest, and this cluster has four nodes, each node has three disks.  And the ceph version is 0.61.4.

 

Because of electrical power down, the ceph cluster have been shutdown innormally several days ago. When I restarted all the nodes and started the ceph service in each node, two osd service are down and out, and the error

 

message shows “ File system of the disk need to be repair”, so I execute these CLI “xfs_check and xfs_repair -L”. After that, I can mount the disk in the specific directory and see the raw object data in the right state, then I start the

 

specific osd service but the osd service are also down and out and the error log show “leveldb: Corruption : checksum mismatch” , because this error makes several pg “stale+active+clean” and some pgs are lost in the cluster.

 

The details of the error log are as follows:

 

2013-07-09 16:45:31.940767 7f9a5a7ee780  0 ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404), process ceph-osd, pid 4640

2013-07-09 16:45:31.986070 7f9a5a7ee780  0 filestore(/osd0) mount FIEMAP ioctl is supported and appears to work

2013-07-09 16:45:31.986084 7f9a5a7ee780  0 filestore(/osd0) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option

2013-07-09 16:45:31.986649 7f9a5a7ee780  0 filestore(/osd0) mount did NOT detect btrfs

2013-07-09 16:45:32.001812 7f9a5a7ee780  0 filestore(/osd0) mount syncfs(2) syscall fully supported (by glibc and kernel)

2013-07-09 16:45:32.001895 7f9a5a7ee780  0 filestore(/osd0) mount found snaps <>

2013-07-09 16:45:32.003550 7f9a5a7ee780 -1 filestore(/osd0) Error initializing leveldb: Corruption: checksum mismatch

 

2013-07-09 16:45:32.003619 7f9a5a7ee780 -1 ^[[0;31m ** ERROR: error converting store /osd0: (1) Operation not permitted^[[0m

 

 

      In these days , I have tried several ways to resolve these problem and recovery the osd service , but all fails and I have exclude the cause of “xfs_check and xfs_repair” which is not responsible for this issue. So I need your help or some advice to resolve these problem.

 

      At the same time , I have some question about the ceph cluster here, maybe someone can help me or give me a detail explanation.

 

1)       Are there some tools or command lines to move or recovery the pg from one osd to another osd manually?  Or are there some ways to fix the leveldb issue ?

 

2)       I used the rbd service for the guest block storage and when I use the CLI “ceph osd pg map image-name”, I can see only one pg that the rbd block has. Does it mean rbd block are stored in only one pg? So does it mean the maximum of rbd block size is equal to the disk capacity?

 

3)       Are there any ways or best practices to prevent the ceph service from losing pg data when two osd services are down and out (pool size is 2)? Customize the cluster map and rule set in order to spilt the osd service in different failing zones as swift zone concepts, Is that a good way?

 

 

      I need all your help and any idea or suggestion are very appreciated.  Thanks.

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux