Sorry for the noise.. we now have a better idea what happened here. For those that might care, basically we had one client looping while trying to list the / bucket with an incorrect key. rgw was handling this at 1kHz, so congratulations on that. I will now go and read how to either decrease the log level or increase the log rotate frequency. Thanks again, Dan CERN IT On Thu, Apr 18, 2013 at 4:09 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > Replying to myself... > I just noticed this: > > [root@ceph-radosgw01 ceph]# ls -lh /var/log/ceph/ > total 27G > -rw-r--r--. 1 root root 27G Apr 18 16:08 radosgw.log > -rw-r--r--. 1 root root 20 Apr 5 03:13 radosgw.log-20130405.gz > -rw-r--r--. 1 root root 20 Apr 6 03:14 radosgw.log-20130406.gz > -rw-r--r--. 1 root root 20 Apr 7 03:50 radosgw.log-20130407.gz > -rw-r--r--. 1 root root 20 Apr 8 03:29 radosgw.log-20130408.gz > -rw-r--r--. 1 root root 20 Apr 9 03:19 radosgw.log-20130409.gz > -rw-r--r--. 1 root root 20 Apr 10 03:15 radosgw.log-20130410.gz > > -rw-r--r--. 1 root root 0 Apr 11 03:48 radosgw.log-20130411 > > [root@ceph-radosgw01 ceph]# df -h . > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg1-root 37G 37G 0 100% / > > > The radosgw log filled up the disk. Perhaps this caused the problem.. > > Cheers, Dan > CERN IT > > On Thu, Apr 18, 2013 at 3:52 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> Hi, >> >> tl;dr: something deleted the objects from the .rgw.gc and then the pgs >> went inconsistent. Is this normal??!! >> >> Just now we had scrub errors and resulting inconsistencies on many of >> the pgs belonging to our .rgw.gc pool. >> >> HEALTH_ERR 119 pgs inconsistent; 119 scrub errors >> pg 11.1f0 is active+clean+inconsistent, acting [35,28,4] >> pg 11.1f8 is active+clean+inconsistent, acting [35,28,4] >> pg 11.1fb is active+clean+inconsistent, acting [11,34,38] >> pg 11.1e0 is active+clean+inconsistent, acting [35,28,4] >> pg 11.1e3 is active+clean+inconsistent, acting [11,34,38] >> … >> >> [root@ceph-mon1 ~]# ceph osd lspools >> 0 data,1 metadata,2 rbd,6 volumes,7 images,9 afs,10 .rgw,11 .rgw.gc,12 >> .rgw.control,13 .users.uid,14 .users.email,15 .users,16 >> .rgw.buckets,17 .usage, >> >> >> On the relevant hosts, I checked what was in those directories: >> >> [root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a >> total 20 >> drwxr-xr-x. 2 root root 6 Apr 16 10:48 . >> drwxr-xr-x. 419 root root 12288 Apr 16 11:15 .. >> >> They were all empty like that. I checked the log files: >> >> 2013-04-18 14:53:56.532054 7fe5457fb700 0 log [ERR] : 11.0 deep-scrub >> stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. >> 2013-04-18 14:53:56.532065 7fe5457fb700 0 log [ERR] : 11.0 deep-scrub 1 errors >> 2013-04-18 14:53:59.532401 7fe5457fb700 0 log [ERR] : 11.8 deep-scrub >> stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. >> 2013-04-18 14:53:59.532411 7fe5457fb700 0 log [ERR] : 11.8 deep-scrub 1 errors >> 2013-04-18 14:54:01.532602 7fe5457fb700 0 log [ERR] : 11.10 >> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. >> 2013-04-18 14:54:01.532614 7fe5457fb700 0 log [ERR] : 11.10 deep-scrub 1 errors >> 2013-04-18 14:54:02.532839 7fe5457fb700 0 log [ERR] : 11.18 >> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. >> 2013-04-18 14:54:02.532848 7fe5457fb700 0 log [ERR] : 11.18 deep-scrub 1 errors >> … >> 2013-04-18 14:57:14.554431 7fe5457fb700 0 log [ERR] : 11.1f0 >> deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. >> 2013-04-18 14:57:14.554438 7fe5457fb700 0 log [ERR] : 11.1f0 >> deep-scrub 1 errors >> >> So it looks like something deleted all the objects from those pg directories. >> Next I tried a repair: >> >> [root@ceph-mon1 ~]# ceph pg repair 11.1f0 >> instructing pg 11.1f0 on osd.35 to repair >> [root@ceph-mon1 ~]# ceph -w >> … >> 2013-04-18 15:19:23.676728 osd.35 [ERR] 11.1f0 repair stat mismatch, >> got 0/3 objects, 0/0 clones, 0/0 bytes. >> 2013-04-18 15:19:23.676783 osd.35 [ERR] 11.1f0 repair 1 errors, 1 fixed >> [root@ceph-mon1 ~]# ceph pg deep-scrub 11.1f0 >> instructing pg 11.1f0 on osd.35 to deep-scrub >> [root@ceph-mon1 ~]# ceph -w >> … >> 2013-04-18 15:20:21.769446 mon.0 [INF] pgmap v31714: 3808 pgs: 3690 >> active+clean, 118 active+clean+inconsistent; 73284 MB data, 276 GB >> used, 44389 GB / 44665 GB avail >> 2013-04-18 15:20:17.677058 osd.35 [INF] 11.1f0 deep-scrub ok >> >> So indeed the repair "fixed" the problem (now there are only 118 >> inconsistent pgs, down from 119). And note that there is still nothing >> in the directory for that pg, as expected: >> >> [root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a >> total 20 >> drwxr-xr-x. 2 root root 6 Apr 16 10:48 . >> drwxr-xr-x. 419 root root 12288 Apr 16 11:15 .. >> >> >> So my question is: can anyone explain what happened here? It seems >> that something deleted the objects from the .rgw.gc pool (as one would >> expect) but the pgs were left inconsistent afterwards. >> >> Best Regards, >> Dan van der Ster >> CERN IT _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com