Replying to myself... I just noticed this: [root@ceph-radosgw01 ceph]# ls -lh /var/log/ceph/ total 27G -rw-r--r--. 1 root root 27G Apr 18 16:08 radosgw.log -rw-r--r--. 1 root root 20 Apr 5 03:13 radosgw.log-20130405.gz -rw-r--r--. 1 root root 20 Apr 6 03:14 radosgw.log-20130406.gz -rw-r--r--. 1 root root 20 Apr 7 03:50 radosgw.log-20130407.gz -rw-r--r--. 1 root root 20 Apr 8 03:29 radosgw.log-20130408.gz -rw-r--r--. 1 root root 20 Apr 9 03:19 radosgw.log-20130409.gz -rw-r--r--. 1 root root 20 Apr 10 03:15 radosgw.log-20130410.gz -rw-r--r--. 1 root root 0 Apr 11 03:48 radosgw.log-20130411 [root@ceph-radosgw01 ceph]# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg1-root 37G 37G 0 100% / The radosgw log filled up the disk. Perhaps this caused the problem.. Cheers, Dan CERN IT On Thu, Apr 18, 2013 at 3:52 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > Hi, > > tl;dr: something deleted the objects from the .rgw.gc and then the pgs > went inconsistent. Is this normal??!! > > Just now we had scrub errors and resulting inconsistencies on many of > the pgs belonging to our .rgw.gc pool. > > HEALTH_ERR 119 pgs inconsistent; 119 scrub errors > pg 11.1f0 is active+clean+inconsistent, acting [35,28,4] > pg 11.1f8 is active+clean+inconsistent, acting [35,28,4] > pg 11.1fb is active+clean+inconsistent, acting [11,34,38] > pg 11.1e0 is active+clean+inconsistent, acting [35,28,4] > pg 11.1e3 is active+clean+inconsistent, acting [11,34,38] > … > > [root@ceph-mon1 ~]# ceph osd lspools > 0 data,1 metadata,2 rbd,6 volumes,7 images,9 afs,10 .rgw,11 .rgw.gc,12 > .rgw.control,13 .users.uid,14 .users.email,15 .users,16 > .rgw.buckets,17 .usage, > > > On the relevant hosts, I checked what was in those directories: > > [root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a > total 20 > drwxr-xr-x. 2 root root 6 Apr 16 10:48 . > drwxr-xr-x. 419 root root 12288 Apr 16 11:15 .. > > They were all empty like that. I checked the log files: > > 2013-04-18 14:53:56.532054 7fe5457fb700 0 log [ERR] : 11.0 deep-scrub > stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. > 2013-04-18 14:53:56.532065 7fe5457fb700 0 log [ERR] : 11.0 deep-scrub 1 errors > 2013-04-18 14:53:59.532401 7fe5457fb700 0 log [ERR] : 11.8 deep-scrub > stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. > 2013-04-18 14:53:59.532411 7fe5457fb700 0 log [ERR] : 11.8 deep-scrub 1 errors > 2013-04-18 14:54:01.532602 7fe5457fb700 0 log [ERR] : 11.10 > deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. > 2013-04-18 14:54:01.532614 7fe5457fb700 0 log [ERR] : 11.10 deep-scrub 1 errors > 2013-04-18 14:54:02.532839 7fe5457fb700 0 log [ERR] : 11.18 > deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. > 2013-04-18 14:54:02.532848 7fe5457fb700 0 log [ERR] : 11.18 deep-scrub 1 errors > … > 2013-04-18 14:57:14.554431 7fe5457fb700 0 log [ERR] : 11.1f0 > deep-scrub stat mismatch, got 0/3 objects, 0/0 clones, 0/0 bytes. > 2013-04-18 14:57:14.554438 7fe5457fb700 0 log [ERR] : 11.1f0 > deep-scrub 1 errors > > So it looks like something deleted all the objects from those pg directories. > Next I tried a repair: > > [root@ceph-mon1 ~]# ceph pg repair 11.1f0 > instructing pg 11.1f0 on osd.35 to repair > [root@ceph-mon1 ~]# ceph -w > … > 2013-04-18 15:19:23.676728 osd.35 [ERR] 11.1f0 repair stat mismatch, > got 0/3 objects, 0/0 clones, 0/0 bytes. > 2013-04-18 15:19:23.676783 osd.35 [ERR] 11.1f0 repair 1 errors, 1 fixed > [root@ceph-mon1 ~]# ceph pg deep-scrub 11.1f0 > instructing pg 11.1f0 on osd.35 to deep-scrub > [root@ceph-mon1 ~]# ceph -w > … > 2013-04-18 15:20:21.769446 mon.0 [INF] pgmap v31714: 3808 pgs: 3690 > active+clean, 118 active+clean+inconsistent; 73284 MB data, 276 GB > used, 44389 GB / 44665 GB avail > 2013-04-18 15:20:17.677058 osd.35 [INF] 11.1f0 deep-scrub ok > > So indeed the repair "fixed" the problem (now there are only 118 > inconsistent pgs, down from 119). And note that there is still nothing > in the directory for that pg, as expected: > > [root@lxfsrc4906 ~]# ls -l //var/lib/ceph/osd/ceph-35/current/11.1f0_head/ -a > total 20 > drwxr-xr-x. 2 root root 6 Apr 16 10:48 . > drwxr-xr-x. 419 root root 12288 Apr 16 11:15 .. > > > So my question is: can anyone explain what happened here? It seems > that something deleted the objects from the .rgw.gc pool (as one would > expect) but the pgs were left inconsistent afterwards. > > Best Regards, > Dan van der Ster > CERN IT _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com