Re: spontaneous pg inconstancies in the rgw.gc pool

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 18 Apr 2013 16:57:31 +0200

Sorry for the noise.. we now have a better idea what happened here.

For those that might care, basically we had one client looping while
trying to list the / bucket with an incorrect key. rgw was handling
this at 1kHz, so congratulations on that. I will now go and read how
to either decrease the log level or increase the log rotate frequency.

Thanks again,
Dan
CERN IT

On Thu, Apr 18, 2013 at 4:09 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Replying to myself...
> I just noticed this:
>
> [root@ceph-radosgw01 ceph]# ls -lh /var/log/ceph/
> total 27G
> -rw-r--r--. 1 root root 27G Apr 18 16:08 radosgw.log
> -rw-r--r--. 1 root root  20 Apr  5 03:13 radosgw.log-20130405.gz
> -rw-r--r--. 1 root root  20 Apr  6 03:14 radosgw.log-20130406.gz
> -rw-r--r--. 1 root root  20 Apr  7 03:50 radosgw.log-20130407.gz
> -rw-r--r--. 1 root root  20 Apr  8 03:29 radosgw.log-20130408.gz
> -rw-r--r--. 1 root root  20 Apr  9 03:19 radosgw.log-20130409.gz
> -rw-r--r--. 1 root root  20 Apr 10 03:15 radosgw.log-20130410.gz
>
> -rw-r--r--. 1 root root 0 Apr 11 03:48 radosgw.log-20130411
>
> [root@ceph-radosgw01 ceph]# df -h .
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg1-root   37G   37G     0 100% /
>
>
> The radosgw log filled up the disk. Perhaps this caused the problem..
>
> Cheers, Dan
> CERN IT
>
> On Thu, Apr 18, 2013 at 3:52 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>> Hi,
>>
>> tl;dr: something deleted the objects from the .rgw.gc and then the pgs
>> went inconsistent. Is this normal??!!
>>
>> Just now we had scrub errors and resulting inconsistencies on many of
>> the pgs belonging to our .rgw.gc pool.
>>
>> HEALTH_ERR 119 pgs inconsistent; 119 scrub errors
>> pg 11.1f0 is active+clean+inconsistent, acting [35,28,4]
>> pg 11.1f8 is active+clean+inconsistent, acting [35,28,4]
>> pg 11.1fb is active+clean+inconsistent, acting [11,34,38]
>> pg 11.1e0 is active+clean+inconsistent, acting [35,28,4]
>> pg 11.1e3 is active+clean+inconsistent, acting [11,34,38]
>> …
>>
>> [root@ceph-mon1 ~]# ceph osd lspools
>> 0 data,1 metadata,2 rbd,6 volumes,7 images,9 afs,10 .rgw,11 .rgw.gc,12
>> .rgw.control,13 .users.uid,14 .users.email,15 .users,16
>> .rgw.buckets,17 .usage,
>>
>>
>> On the relevant hosts, I checked what was in those directories:
>>
>> [root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a
>> total 20
>> drwxr-xr-x.   2 root root     6 Apr 16 10:48 .
>> drwxr-xr-x. 419 root root 12288 Apr 16 11:15 ..
>>
>> They were all empty like that. I checked the log files:
>>
>> 2013-04-18 14:53:56.532054 7fe5457fb700  0 log [ERR] : 11.0 deep-scrub
>> stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
>> 2013-04-18 14:53:56.532065 7fe5457fb700  0 log [ERR] : 11.0 deep-scrub 1 errors
>> 2013-04-18 14:53:59.532401 7fe5457fb700  0 log [ERR] : 11.8 deep-scrub
>> stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
>> 2013-04-18 14:53:59.532411 7fe5457fb700  0 log [ERR] : 11.8 deep-scrub 1 errors
>> 2013-04-18 14:54:01.532602 7fe5457fb700  0 log [ERR] : 11.10
>> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
>> 2013-04-18 14:54:01.532614 7fe5457fb700  0 log [ERR] : 11.10 deep-scrub 1 errors
>> 2013-04-18 14:54:02.532839 7fe5457fb700  0 log [ERR] : 11.18
>> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
>> 2013-04-18 14:54:02.532848 7fe5457fb700  0 log [ERR] : 11.18 deep-scrub 1 errors
>> …
>> 2013-04-18 14:57:14.554431 7fe5457fb700  0 log [ERR] : 11.1f0
>> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes.
>> 2013-04-18 14:57:14.554438 7fe5457fb700  0 log [ERR] : 11.1f0
>> deep-scrub 1 errors
>>
>> So it looks like something deleted all the objects from those pg directories.
>> Next I tried a repair:
>>
>> [root@ceph-mon1 ~]# ceph pg repair 11.1f0
>> instructing pg 11.1f0 on osd.35 to repair
>> [root@ceph-mon1 ~]# ceph -w
>> …
>> 2013-04-18 15:19:23.676728 osd.35 [ERR] 11.1f0 repair stat mismatch,
>> got 0/3 objects, 0/0 clones, 0/0 bytes.
>> 2013-04-18 15:19:23.676783 osd.35 [ERR] 11.1f0 repair 1 errors, 1 fixed
>> [root@ceph-mon1 ~]# ceph pg deep-scrub 11.1f0
>> instructing pg 11.1f0 on osd.35 to deep-scrub
>> [root@ceph-mon1 ~]# ceph -w
>> …
>> 2013-04-18 15:20:21.769446 mon.0 [INF] pgmap v31714: 3808 pgs: 3690
>> active+clean, 118 active+clean+inconsistent; 73284 MB data, 276 GB
>> used, 44389 GB / 44665 GB avail
>> 2013-04-18 15:20:17.677058 osd.35 [INF] 11.1f0 deep-scrub ok
>>
>> So indeed the repair "fixed" the problem (now there are only 118
>> inconsistent pgs, down from 119). And note that there is still nothing
>> in the directory for that pg, as expected:
>>
>> [root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a
>> total 20
>> drwxr-xr-x.   2 root root     6 Apr 16 10:48 .
>> drwxr-xr-x. 419 root root 12288 Apr 16 11:15 ..
>>
>>
>> So my question is: can anyone explain what happened here? It seems
>> that something deleted the objects from the .rgw.gc pool (as one would
>> expect) but the pgs were left inconsistent afterwards.
>>
>> Best Regards,
>> Dan van der Ster
>> CERN IT
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com