Re: how to debug pg inconsistent state - no ioerrors seen

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Wed, 10 Aug 2016 16:15:14 +1000

Hi Greg...

Thanks for replying, You seem omnipresent in all ceph/cephfs issues!

Can you please confirm that, in Jewel, 'ceph pg repair' simply copies 
the pg contents of the primary osd to the others? And that can lead to 
data corruption if the problematic osd is indeed the primary?

If in Jewel there is some clever way for the system to know which osd 
has the problematic pg/object, than there is no real need to inspect the 
pgs in the different osds. If that is not the case, we need to find out 
what is the osd with the incorrect data.

I am not sure if 'ceph pg <id> query' can give you clear indication of 
what is wrong. Will have to look closely to the output of a ceph pg <id> 
query and see what it reports once it happens.

The other alternative, i.e., using ceph-objectstore-tool to inspect an 
object odmap key/values per osd, can't be used on an an active / live 
osd. For example, just trying to get some info from a pg, I get:

  # ceph-objectstore-tool --data /var/lib/ceph/osd/ceph-0 --journal 
/var/lib/ceph/osd/ceph-0/journal --op info --pgid 5.291
  OSD has the store locked

So, it seems that we have to stop the osd (and set ithe cluster to noout 
to avoid rebalancing) to release the lock, which will bring the cluster 
to a not-ok state (which is kind of ugly).

Cheers

G.

On 08/10/2016 12:13 AM, Gregory Farnum wrote:
On Tue, Aug 9, 2016 at 2:00 AM, Kenneth Waegeman
<kenneth.waegeman@xxxxxxxx> wrote:
Hi,

I did a diff on the directories of all three the osds, no difference .. So I
don't know what's wrong.
omap (as implied by the omap_digest complaint) is stored in the OSD
leveldb, not in the data directories, so you wouldn't expect to see
any differences from a raw diff. I think you can extract the omaps as
well by using the ceph-objectstore-tool or whatever it's called (I
haven't done it myself) and compare those. Should see if you get more
useful information out of the pg query first, though!
-Greg

Only thing I see different is a scrub file in the TEMP folder (it is already
another pg than last mail):

-rw-r--r--    1 ceph ceph     0 Aug  9 09:51
scrub\u6.107__head_00000107__fffffffffffffff8

But it is empty..

Thanks!

On 09/08/16 04:33, Goncalo Borges wrote:
Hi Kenneth...

The previous default behavior of 'ceph pg repair' was to copy the pg
objects from the primary osd to others. Not sure if it is till the case in
Jewel. For this reason, once we get these kind of errors in a data pool, the
best practice is to compare the md5 checksums of the damaged object in all
osds involved in the inconsistent pg. Since we have a 3 replica cluster, we
should find a 2 good object quorum. If by chance the primary osd has the
wrong object, it should delete it before running  the repair.

On a metadata pool, I am not sure exactly how to cross check since all
objects are size 0 and therefore, md5sum is meaningless. Maybe, one way
forward could be to check the contents of the pg directories (ex:
/var/lib/ceph/osd/ceph-0/current/5.161_head/) in all osds involved for the
pg and see if we spot something wrong?

Cheers

G.

On 08/08/2016 09:40 PM, Kenneth Waegeman wrote:
Hi all,

Since last week, some pg's are going in the inconsistent state after a
scrub error. Last week we had 4 pgs in that state, They were on different
OSDS, but all of the metadata pool.
I did a pg repair on them, and all were healthy again. But now again one
pg is inconsistent.

with health detail I see:

pg 6.2f4 is active+clean+inconsistent, acting [3,5,1]
1 scrub errors

And in the log of the primary:

2016-08-06 06:24:44.723224 7fc5493f3700 -1 log_channel(cluster) log [ERR]
: 6.2f4 shard 5: soid 6:2f55791f:::606.00000000:head omap_digest 0x3a105358
!= best guess omap_digest 0xc85c4361 from auth shard 1
2016-08-06 06:24:53.931029 7fc54bbf8700 -1 log_channel(cluster) log [ERR]
: 6.2f4 deep-scrub 0 missing, 1 inconsistent objects
2016-08-06 06:24:53.931055 7fc54bbf8700 -1 log_channel(cluster) log [ERR]
: 6.2f4 deep-scrub 1 errors

I looked in dmesg but I couldn't see any IO errors on any of the OSDs in
the acting set.  Last week it was another set. It is of course possible more
than 1 OSD is failing, but how can we check this, since there is nothing
more in the logs?

Thanks !!

K
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com