spontaneous pg inconstancies in the rgw.gc pool

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 18 Apr 2013 15:52:42 +0200

Hi,

tl;dr: something deleted the objects from the .rgw.gc and then the pgs
went inconsistent. Is this normal??!!

Just now we had scrub errors and resulting inconsistencies on many of
the pgs belonging to our .rgw.gc pool.

HEALTH_ERR 119 pgs inconsistent; 119 scrub errors
pg 11.1f0 is active+clean+inconsistent, acting [35,28,4]
pg 11.1f8 is active+clean+inconsistent, acting [35,28,4]
pg 11.1fb is active+clean+inconsistent, acting [11,34,38]
pg 11.1e0 is active+clean+inconsistent, acting [35,28,4]
pg 11.1e3 is active+clean+inconsistent, acting [11,34,38]
…

[root@ceph-mon1 ~]# ceph osd lspools
0 data,1 metadata,2 rbd,6 volumes,7 images,9 afs,10 .rgw,11 .rgw.gc,12
.rgw.control,13 .users.uid,14 .users.email,15 .users,16
.rgw.buckets,17 .usage,

On the relevant hosts, I checked what was in those directories:

[root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a
total 20
drwxr-xr-x.   2 root root     6 Apr 16 10:48 .
drwxr-xr-x. 419 root root 12288 Apr 16 11:15 ..

They were all empty like that. I checked the log files:

2013-04-18 14:53:56.532054 7fe5457fb700  0 log [ERR] : 11.0 deep-scrub
stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
2013-04-18 14:53:56.532065 7fe5457fb700  0 log [ERR] : 11.0 deep-scrub 1 errors
2013-04-18 14:53:59.532401 7fe5457fb700  0 log [ERR] : 11.8 deep-scrub
stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
2013-04-18 14:53:59.532411 7fe5457fb700  0 log [ERR] : 11.8 deep-scrub 1 errors
2013-04-18 14:54:01.532602 7fe5457fb700  0 log [ERR] : 11.10
deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
2013-04-18 14:54:01.532614 7fe5457fb700  0 log [ERR] : 11.10 deep-scrub 1 errors
2013-04-18 14:54:02.532839 7fe5457fb700  0 log [ERR] : 11.18
deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
2013-04-18 14:54:02.532848 7fe5457fb700  0 log [ERR] : 11.18 deep-scrub 1 errors
…
2013-04-18 14:57:14.554431 7fe5457fb700  0 log [ERR] : 11.1f0
deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
2013-04-18 14:57:14.554438 7fe5457fb700  0 log [ERR] : 11.1f0
deep-scrub 1 errors

So it looks like something deleted all the objects from those pg directories.
Next I tried a repair:

[root@ceph-mon1 ~]# ceph pg repair 11.1f0
instructing pg 11.1f0 on osd.35 to repair
[root@ceph-mon1 ~]# ceph -w
…
2013-04-18 15:19:23.676728 osd.35 [ERR] 11.1f0 repair stat mismatch,
got 0/3 objects, 0/0 clones, 0/0 bytes.
2013-04-18 15:19:23.676783 osd.35 [ERR] 11.1f0 repair 1 errors, 1 fixed
[root@ceph-mon1 ~]# ceph pg deep-scrub 11.1f0
instructing pg 11.1f0 on osd.35 to deep-scrub
[root@ceph-mon1 ~]# ceph -w
…
2013-04-18 15:20:21.769446 mon.0 [INF] pgmap v31714: 3808 pgs: 3690
active+clean, 118 active+clean+inconsistent; 73284 MB data, 276 GB
used, 44389 GB / 44665 GB avail
2013-04-18 15:20:17.677058 osd.35 [INF] 11.1f0 deep-scrub ok

So indeed the repair "fixed" the problem (now there are only 118
inconsistent pgs, down from 119). And note that there is still nothing
in the directory for that pg, as expected:

[root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a
total 20
drwxr-xr-x.   2 root root     6 Apr 16 10:48 .
drwxr-xr-x. 419 root root 12288 Apr 16 11:15 ..

So my question is: can anyone explain what happened here? It seems
that something deleted the objects from the .rgw.gc pool (as one would
expect) but the pgs were left inconsistent afterwards.

Best Regards,
Dan van der Ster
CERN IT
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com