Re: how to debug pg inconsistent state - no ioerrors seen

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kenneth...

The previous default behavior of 'ceph pg repair' was to copy the pg objects from the primary osd to others. Not sure if it is till the case in Jewel. For this reason, once we get these kind of errors in a data pool, the best practice is to compare the md5 checksums of the damaged object in all osds involved in the inconsistent pg. Since we have a 3 replica cluster, we should find a 2 good object quorum. If by chance the primary osd has the wrong object, it should delete it before running the repair.

On a metadata pool, I am not sure exactly how to cross check since all objects are size 0 and therefore, md5sum is meaningless. Maybe, one way forward could be to check the contents of the pg directories (ex: /var/lib/ceph/osd/ceph-0/current/5.161_head/) in all osds involved for the pg and see if we spot something wrong?

Cheers

G.


On 08/08/2016 09:40 PM, Kenneth Waegeman wrote:
Hi all,

Since last week, some pg's are going in the inconsistent state after a scrub error. Last week we had 4 pgs in that state, They were on different OSDS, but all of the metadata pool. I did a pg repair on them, and all were healthy again. But now again one pg is inconsistent.

with health detail I see:

pg 6.2f4 is active+clean+inconsistent, acting [3,5,1]
1 scrub errors

And in the log of the primary:

2016-08-06 06:24:44.723224 7fc5493f3700 -1 log_channel(cluster) log [ERR] : 6.2f4 shard 5: soid 6:2f55791f:::606.00000000:head omap_digest 0x3a105358 != best guess omap_digest 0xc85c4361 from auth shard 1 2016-08-06 06:24:53.931029 7fc54bbf8700 -1 log_channel(cluster) log [ERR] : 6.2f4 deep-scrub 0 missing, 1 inconsistent objects 2016-08-06 06:24:53.931055 7fc54bbf8700 -1 log_channel(cluster) log [ERR] : 6.2f4 deep-scrub 1 errors

I looked in dmesg but I couldn't see any IO errors on any of the OSDs in the acting set. Last week it was another set. It is of course possible more than 1 OSD is failing, but how can we check this, since there is nothing more in the logs?

Thanks !!

K
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux