Those pools were a few things: rgw.buckets plus a couple pools we use for developing new librados clients. But the source of this issue is likely related to the few pre-hammer development releases (and crashes) we upgraded through whilst running a large scale test. Anyway, now I'll know how to better debug this in future so we'll let you know if it reoccurs. Cheers, Dan On Wed, Jul 22, 2015 at 9:42 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: > Annoying that we don't know what caused the replica's stat structure to get out of sync. Let us know if you see it recur. What were those pools used for? > -Sam > > ----- Original Message ----- > From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> > To: "Samuel Just" <sjust@xxxxxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx > Sent: Wednesday, July 22, 2015 12:36:53 PM > Subject: Re: PGs going inconsistent after stopping the primary > > Cool, writing some objects to the affected PGs has stopped the > consistent/inconsistent cycle. I'll keep an eye on them but this seems > to have fixed the problem. > Thanks!! > Dan > > On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> Looks like it's just a stat error. The primary appears to have the correct stats, but the replica for some reason doesn't (thinks there's an object for some reason). I bet it clears itself it you perform a write on the pg since the primary will send over its stats. We'd need information from when the stat error originally occurred to debug further. >> -Sam >> >> ----- Original Message ----- >> From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx> >> To: ceph-users@xxxxxxxxxxxxxx >> Sent: Wednesday, July 22, 2015 7:49:00 AM >> Subject: PGs going inconsistent after stopping the primary >> >> Hi Ceph community, >> >> Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64 >> >> We wanted to post here before the tracker to see if someone else has >> had this problem. >> >> We have a few PGs (different pools) which get marked inconsistent when >> we stop the primary OSD. The problem is strange because once we >> restart the primary, then scrub the PG, the PG is marked active+clean. >> But inevitably next time we stop the primary OSD, the same PG is >> marked inconsistent again. >> >> There is no user activity on this PG, and nothing interesting is >> logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line >> mentioning the PG already says inactive+inconsistent). >> >> >> We suspect this is related to garbage files left in the PG folder. One >> of our PGs is acting basically like above, except it goes through this >> cycle: active+clean -> (deep-scrub) -> active+clean+inconsistent -> >> (repair) -> active+clean -> (restart primary OSD) -> (deep-scrub) -> >> active+clean+inconsistent. This one at least logs: >> >> 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts >> 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat >> mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 >> hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. >> 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors >> >> and this should be debuggable because there is only one object in the pool: >> >> tapetest 55 0 0 73575G 1 >> >> even though rados ls returns no objects: >> >> # rados ls -p tapetest >> # >> >> Any ideas? >> >> Cheers, Dan >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com