Re: PGs going inconsistent after stopping the primary

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Looks like it's just a stat error.  The primary appears to have the correct stats, but the replica for some reason doesn't (thinks there's an object for some reason).  I bet it clears itself it you perform a write on the pg since the primary will send over its stats.  We'd need information from when the stat error originally occurred to debug further.
-Sam

----- Original Message -----
From: "Dan van der Ster" <dan@xxxxxxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Wednesday, July 22, 2015 7:49:00 AM
Subject:  PGs going inconsistent after stopping the primary

Hi Ceph community,

Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64

We wanted to post here before the tracker to see if someone else has
had this problem.

We have a few PGs (different pools) which get marked inconsistent when
we stop the primary OSD. The problem is strange because once we
restart the primary, then scrub the PG, the PG is marked active+clean.
But inevitably next time we stop the primary OSD, the same PG is
marked inconsistent again.

There is no user activity on this PG, and nothing interesting is
logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line
mentioning the PG already says inactive+inconsistent).


We suspect this is related to garbage files left in the PG folder. One
of our PGs is acting basically like above, except it goes through this
cycle: active+clean -> (deep-scrub) -> active+clean+inconsistent ->
(repair) -> active+clean -> (restart primary OSD) -> (deep-scrub) ->
active+clean+inconsistent. This one at least logs:

2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts
2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat
mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0
hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes.
2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors

and this should be debuggable because there is only one object in the pool:

    tapetest               55           0         0        73575G           1

even though rados ls returns no objects:

# rados ls -p tapetest
#

Any ideas?

Cheers, Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux