Re: how to debug pg inconsistent state - no ioerrors seen

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 24 Aug 2016 13:00:29 -0700

On Tue, Aug 9, 2016 at 11:15 PM, Goncalo Borges
<goncalo.borges@xxxxxxxxxxxxx> wrote:
> Hi Greg...
>
> Thanks for replying, You seem omnipresent in all ceph/cephfs issues!
>
> Can you please confirm that, in Jewel, 'ceph pg repair' simply copies the pg
> contents of the primary osd to the others? And that can lead to data
> corruption if the problematic osd is indeed the primary?
>
> If in Jewel there is some clever way for the system to know which osd has
> the problematic pg/object, than there is no real need to inspect the pgs in
> the different osds. If that is not the case, we need to find out what is the
> osd with the incorrect data.

I think some of the new scrub infrastructure went in, but I'm not sure
how much. David or Sam?

> I am not sure if 'ceph pg <id> query' can give you clear indication of what
> is wrong. Will have to look closely to the output of a ceph pg <id> query
> and see what it reports once it happens.
>
> The other alternative, i.e., using ceph-objectstore-tool to inspect an
> object odmap key/values per osd, can't be used on an an active / live osd.
> For example, just trying to get some info from a pg, I get:
>
>   # ceph-objectstore-tool --data /var/lib/ceph/osd/ceph-0 --journal
> /var/lib/ceph/osd/ceph-0/journal --op info --pgid 5.291
>   OSD has the store locked
>
> So, it seems that we have to stop the osd (and set ithe cluster to noout to
> avoid rebalancing) to release the lock, which will bring the cluster to a
> not-ok state (which is kind of ugly).

Ah yeah. I'm not aware of anything more elegant right now though. :(
-Greg

>
> Cheers
>
> G.
>
>
>
>
>
>
>
>
> On 08/10/2016 12:13 AM, Gregory Farnum wrote:
>>
>> On Tue, Aug 9, 2016 at 2:00 AM, Kenneth Waegeman
>> <kenneth.waegeman@xxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> I did a diff on the directories of all three the osds, no difference ..
>>> So I
>>> don't know what's wrong.
>>
>> omap (as implied by the omap_digest complaint) is stored in the OSD
>> leveldb, not in the data directories, so you wouldn't expect to see
>> any differences from a raw diff. I think you can extract the omaps as
>> well by using the ceph-objectstore-tool or whatever it's called (I
>> haven't done it myself) and compare those. Should see if you get more
>> useful information out of the pg query first, though!
>> -Greg
>>
>>> Only thing I see different is a scrub file in the TEMP folder (it is
>>> already
>>> another pg than last mail):
>>>
>>> -rw-r--r--    1 ceph ceph     0 Aug  9 09:51
>>> scrub\u6.107__head_00000107__fffffffffffffff8
>>>
>>> But it is empty..
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On 09/08/16 04:33, Goncalo Borges wrote:
>>>>
>>>> Hi Kenneth...
>>>>
>>>> The previous default behavior of 'ceph pg repair' was to copy the pg
>>>> objects from the primary osd to others. Not sure if it is till the case
>>>> in
>>>> Jewel. For this reason, once we get these kind of errors in a data pool,
>>>> the
>>>> best practice is to compare the md5 checksums of the damaged object in
>>>> all
>>>> osds involved in the inconsistent pg. Since we have a 3 replica cluster,
>>>> we
>>>> should find a 2 good object quorum. If by chance the primary osd has the
>>>> wrong object, it should delete it before running  the repair.
>>>>
>>>> On a metadata pool, I am not sure exactly how to cross check since all
>>>> objects are size 0 and therefore, md5sum is meaningless. Maybe, one way
>>>> forward could be to check the contents of the pg directories (ex:
>>>> /var/lib/ceph/osd/ceph-0/current/5.161_head/) in all osds involved for
>>>> the
>>>> pg and see if we spot something wrong?
>>>>
>>>> Cheers
>>>>
>>>> G.
>>>>
>>>>
>>>> On 08/08/2016 09:40 PM, Kenneth Waegeman wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Since last week, some pg's are going in the inconsistent state after a
>>>>> scrub error. Last week we had 4 pgs in that state, They were on
>>>>> different
>>>>> OSDS, but all of the metadata pool.
>>>>> I did a pg repair on them, and all were healthy again. But now again
>>>>> one
>>>>> pg is inconsistent.
>>>>>
>>>>> with health detail I see:
>>>>>
>>>>> pg 6.2f4 is active+clean+inconsistent, acting [3,5,1]
>>>>> 1 scrub errors
>>>>>
>>>>> And in the log of the primary:
>>>>>
>>>>> 2016-08-06 06:24:44.723224 7fc5493f3700 -1 log_channel(cluster) log
>>>>> [ERR]
>>>>> : 6.2f4 shard 5: soid 6:2f55791f:::606.00000000:head omap_digest
>>>>> 0x3a105358
>>>>> != best guess omap_digest 0xc85c4361 from auth shard 1
>>>>> 2016-08-06 06:24:53.931029 7fc54bbf8700 -1 log_channel(cluster) log
>>>>> [ERR]
>>>>> : 6.2f4 deep-scrub 0 missing, 1 inconsistent objects
>>>>> 2016-08-06 06:24:53.931055 7fc54bbf8700 -1 log_channel(cluster) log
>>>>> [ERR]
>>>>> : 6.2f4 deep-scrub 1 errors
>>>>>
>>>>> I looked in dmesg but I couldn't see any IO errors on any of the OSDs
>>>>> in
>>>>> the acting set.  Last week it was another set. It is of course possible
>>>>> more
>>>>> than 1 OSD is failing, but how can we check this, since there is
>>>>> nothing
>>>>> more in the logs?
>>>>>
>>>>> Thanks !!
>>>>>
>>>>> K
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com