On Tue, Aug 9, 2016 at 2:00 AM, Kenneth Waegeman <kenneth.waegeman@xxxxxxxx> wrote: > Hi, > > I did a diff on the directories of all three the osds, no difference .. So I > don't know what's wrong. omap (as implied by the omap_digest complaint) is stored in the OSD leveldb, not in the data directories, so you wouldn't expect to see any differences from a raw diff. I think you can extract the omaps as well by using the ceph-objectstore-tool or whatever it's called (I haven't done it myself) and compare those. Should see if you get more useful information out of the pg query first, though! -Greg > > Only thing I see different is a scrub file in the TEMP folder (it is already > another pg than last mail): > > -rw-r--r-- 1 ceph ceph 0 Aug 9 09:51 > scrub\u6.107__head_00000107__fffffffffffffff8 > > But it is empty.. > > Thanks! > > > > On 09/08/16 04:33, Goncalo Borges wrote: >> >> Hi Kenneth... >> >> The previous default behavior of 'ceph pg repair' was to copy the pg >> objects from the primary osd to others. Not sure if it is till the case in >> Jewel. For this reason, once we get these kind of errors in a data pool, the >> best practice is to compare the md5 checksums of the damaged object in all >> osds involved in the inconsistent pg. Since we have a 3 replica cluster, we >> should find a 2 good object quorum. If by chance the primary osd has the >> wrong object, it should delete it before running the repair. >> >> On a metadata pool, I am not sure exactly how to cross check since all >> objects are size 0 and therefore, md5sum is meaningless. Maybe, one way >> forward could be to check the contents of the pg directories (ex: >> /var/lib/ceph/osd/ceph-0/current/5.161_head/) in all osds involved for the >> pg and see if we spot something wrong? >> >> Cheers >> >> G. >> >> >> On 08/08/2016 09:40 PM, Kenneth Waegeman wrote: >>> >>> Hi all, >>> >>> Since last week, some pg's are going in the inconsistent state after a >>> scrub error. Last week we had 4 pgs in that state, They were on different >>> OSDS, but all of the metadata pool. >>> I did a pg repair on them, and all were healthy again. But now again one >>> pg is inconsistent. >>> >>> with health detail I see: >>> >>> pg 6.2f4 is active+clean+inconsistent, acting [3,5,1] >>> 1 scrub errors >>> >>> And in the log of the primary: >>> >>> 2016-08-06 06:24:44.723224 7fc5493f3700 -1 log_channel(cluster) log [ERR] >>> : 6.2f4 shard 5: soid 6:2f55791f:::606.00000000:head omap_digest 0x3a105358 >>> != best guess omap_digest 0xc85c4361 from auth shard 1 >>> 2016-08-06 06:24:53.931029 7fc54bbf8700 -1 log_channel(cluster) log [ERR] >>> : 6.2f4 deep-scrub 0 missing, 1 inconsistent objects >>> 2016-08-06 06:24:53.931055 7fc54bbf8700 -1 log_channel(cluster) log [ERR] >>> : 6.2f4 deep-scrub 1 errors >>> >>> I looked in dmesg but I couldn't see any IO errors on any of the OSDs in >>> the acting set. Last week it was another set. It is of course possible more >>> than 1 OSD is failing, but how can we check this, since there is nothing >>> more in the logs? >>> >>> Thanks !! >>> >>> K >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com