Re: spontaneous pg inconstancies in the rgw.gc pool

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 18 Apr 2013 16:09:14 +0200

Replying to myself...
I just noticed this:

[root@ceph-radosgw01 ceph]# ls -lh /var/log/ceph/
total 27G
-rw-r--r--. 1 root root 27G Apr 18 16:08 radosgw.log
-rw-r--r--. 1 root root  20 Apr  5 03:13 radosgw.log-20130405.gz
-rw-r--r--. 1 root root  20 Apr  6 03:14 radosgw.log-20130406.gz
-rw-r--r--. 1 root root  20 Apr  7 03:50 radosgw.log-20130407.gz
-rw-r--r--. 1 root root  20 Apr  8 03:29 radosgw.log-20130408.gz
-rw-r--r--. 1 root root  20 Apr  9 03:19 radosgw.log-20130409.gz
-rw-r--r--. 1 root root  20 Apr 10 03:15 radosgw.log-20130410.gz

-rw-r--r--. 1 root root 0 Apr 11 03:48 radosgw.log-20130411

[root@ceph-radosgw01 ceph]# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg1-root   37G   37G     0 100% /

The radosgw log filled up the disk. Perhaps this caused the problem..

Cheers, Dan
CERN IT

On Thu, Apr 18, 2013 at 3:52 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Hi,
>
> tl;dr: something deleted the objects from the .rgw.gc and then the pgs
> went inconsistent. Is this normal??!!
>
> Just now we had scrub errors and resulting inconsistencies on many of
> the pgs belonging to our .rgw.gc pool.
>
> HEALTH_ERR 119 pgs inconsistent; 119 scrub errors
> pg 11.1f0 is active+clean+inconsistent, acting [35,28,4]
> pg 11.1f8 is active+clean+inconsistent, acting [35,28,4]
> pg 11.1fb is active+clean+inconsistent, acting [11,34,38]
> pg 11.1e0 is active+clean+inconsistent, acting [35,28,4]
> pg 11.1e3 is active+clean+inconsistent, acting [11,34,38]
> …
>
> [root@ceph-mon1 ~]# ceph osd lspools
> 0 data,1 metadata,2 rbd,6 volumes,7 images,9 afs,10 .rgw,11 .rgw.gc,12
> .rgw.control,13 .users.uid,14 .users.email,15 .users,16
> .rgw.buckets,17 .usage,
>
>
> On the relevant hosts, I checked what was in those directories:
>
> [root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a
> total 20
> drwxr-xr-x.   2 root root     6 Apr 16 10:48 .
> drwxr-xr-x. 419 root root 12288 Apr 16 11:15 ..
>
> They were all empty like that. I checked the log files:
>
> 2013-04-18 14:53:56.532054 7fe5457fb700  0 log [ERR] : 11.0 deep-scrub
> stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
> 2013-04-18 14:53:56.532065 7fe5457fb700  0 log [ERR] : 11.0 deep-scrub 1 errors
> 2013-04-18 14:53:59.532401 7fe5457fb700  0 log [ERR] : 11.8 deep-scrub
> stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
> 2013-04-18 14:53:59.532411 7fe5457fb700  0 log [ERR] : 11.8 deep-scrub 1 errors
> 2013-04-18 14:54:01.532602 7fe5457fb700  0 log [ERR] : 11.10
> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
> 2013-04-18 14:54:01.532614 7fe5457fb700  0 log [ERR] : 11.10 deep-scrub 1 errors
> 2013-04-18 14:54:02.532839 7fe5457fb700  0 log [ERR] : 11.18
> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
> 2013-04-18 14:54:02.532848 7fe5457fb700  0 log [ERR] : 11.18 deep-scrub 1 errors
> …
> 2013-04-18 14:57:14.554431 7fe5457fb700  0 log [ERR] : 11.1f0
> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
> 2013-04-18 14:57:14.554438 7fe5457fb700  0 log [ERR] : 11.1f0
> deep-scrub 1 errors
>
> So it looks like something deleted all the objects from those pg directories.
> Next I tried a repair:
>
> [root@ceph-mon1 ~]# ceph pg repair 11.1f0
> instructing pg 11.1f0 on osd.35 to repair
> [root@ceph-mon1 ~]# ceph -w
> …
> 2013-04-18 15:19:23.676728 osd.35 [ERR] 11.1f0 repair stat mismatch,
> got 0/3 objects, 0/0 clones, 0/0 bytes.
> 2013-04-18 15:19:23.676783 osd.35 [ERR] 11.1f0 repair 1 errors, 1 fixed
> [root@ceph-mon1 ~]# ceph pg deep-scrub 11.1f0
> instructing pg 11.1f0 on osd.35 to deep-scrub
> [root@ceph-mon1 ~]# ceph -w
> …
> 2013-04-18 15:20:21.769446 mon.0 [INF] pgmap v31714: 3808 pgs: 3690
> active+clean, 118 active+clean+inconsistent; 73284 MB data, 276 GB
> used, 44389 GB / 44665 GB avail
> 2013-04-18 15:20:17.677058 osd.35 [INF] 11.1f0 deep-scrub ok
>
> So indeed the repair "fixed" the problem (now there are only 118
> inconsistent pgs, down from 119). And note that there is still nothing
> in the directory for that pg, as expected:
>
> [root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a
> total 20
> drwxr-xr-x.   2 root root     6 Apr 16 10:48 .
> drwxr-xr-x. 419 root root 12288 Apr 16 11:15 ..
>
>
> So my question is: can anyone explain what happened here? It seems
> that something deleted the objects from the .rgw.gc pool (as one would
> expect) but the pgs were left inconsistent afterwards.
>
> Best Regards,
> Dan van der Ster
> CERN IT
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com