Re: PGs going inconsistent after stopping the primary

Samuel Just <sjust@xxxxxxxxxx> · Thu, 23 Jul 2015 12:13:14 -0400 (EDT)

Oh, if you were running dev releases, it's not super surprising that the stat tracking was at some point buggy.
-Sam

----- Original Message -----
From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
To: "Samuel Just" <sjust@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Thursday, July 23, 2015 8:21:07 AM
Subject: Re:  PGs going inconsistent after stopping the primary

Those pools were a few things: rgw.buckets plus a couple pools we use
for developing new librados clients. But the source of this issue is
likely related to the few pre-hammer development releases (and
crashes) we upgraded through whilst running a large scale test.
Anyway, now I'll know how to better debug this in future so we'll let
you know if it reoccurs.
Cheers, Dan

On Wed, Jul 22, 2015 at 9:42 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> Annoying that we don't know what caused the replica's stat structure to get out of sync.  Let us know if you see it recur.  What were those pools used for?
> -Sam
>
> ----- Original Message -----
> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
> To: "Samuel Just" <sjust@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Sent: Wednesday, July 22, 2015 12:36:53 PM
> Subject: Re:  PGs going inconsistent after stopping the primary
>
> Cool, writing some objects to the affected PGs has stopped the
> consistent/inconsistent cycle. I'll keep an eye on them but this seems
> to have fixed the problem.
> Thanks!!
> Dan
>
> On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>> Looks like it's just a stat error.  The primary appears to have the correct stats, but the replica for some reason doesn't (thinks there's an object for some reason).  I bet it clears itself it you perform a write on the pg since the primary will send over its stats.  We'd need information from when the stat error originally occurred to debug further.
>> -Sam
>>
>> ----- Original Message -----
>> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
>> To: ceph-users@xxxxxxxxxxxxxx
>> Sent: Wednesday, July 22, 2015 7:49:00 AM
>> Subject:  PGs going inconsistent after stopping the primary
>>
>> Hi Ceph community,
>>
>> Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64
>>
>> We wanted to post here before the tracker to see if someone else has
>> had this problem.
>>
>> We have a few PGs (different pools) which get marked inconsistent when
>> we stop the primary OSD. The problem is strange because once we
>> restart the primary, then scrub the PG, the PG is marked active+clean.
>> But inevitably next time we stop the primary OSD, the same PG is
>> marked inconsistent again.
>>
>> There is no user activity on this PG, and nothing interesting is
>> logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line
>> mentioning the PG already says inactive+inconsistent).
>>
>>
>> We suspect this is related to garbage files left in the PG folder. One
>> of our PGs is acting basically like above, except it goes through this
>> cycle: active+clean -> (deep-scrub) -> active+clean+inconsistent ->
>> (repair) -> active+clean -> (restart primary OSD) -> (deep-scrub) ->
>> active+clean+inconsistent. This one at least logs:
>>
>> 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts
>> 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat
>> mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0
>> hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes.
>> 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors
>>
>> and this should be debuggable because there is only one object in the pool:
>>
>>     tapetest               55           0         0        73575G           1
>>
>> even though rados ls returns no objects:
>>
>> # rados ls -p tapetest
>> #
>>
>> Any ideas?
>>
>> Cheers, Dan
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com