Re: Ceph space problem, garbage collector ?

Olivier Bonvalet <ceph.list@xxxxxxxxx> · Wed, 11 Sep 2013 15:19:52 +0200

Very simple test on a new pool "ssdtest", with 3 replica full SSD
(crushrule 3) :

# rbd create ssdtest/test-mysql --size 102400
# rbd map ssdtest/test-mysql
# dd if=/dev/zero of=/dev/rbd/ssdtest/test-mysql bs=4M count=500
# ceph df | grep ssdtest
    ssdtest        10     2000M     0         502     

host1:# du -skc /var/lib/ceph/osd/ceph-*/*/10.* | tail -n1
3135780    total
host2:# du -skc /var/lib/ceph/osd/ceph-*/*/10.* | tail -n1
3028804    total
→ so 6020kB on disk, wich seems correct (and a find reports 739+767
files of 4MB, so it's also good).

========

First snapshot :

# rbd snap create ssdtest/test-mysql@s1
# dd if=/dev/zero of=/dev/rbd/ssdtest/test-mysql bs=4M count=250
# ceph df | grep ssdtest
    ssdtest        10     3000M     0         752     
2 * # du -skc /var/lib/ceph/osd/ceph-*/*/10.* | tail -n1
→ 9024kB on disk, which is correct again.

========

Second snapshot :

# rbd snap create ssdtest/test-mysql@s2
Here I write 4KB only it 100 differents rados blocks :
# for I in '' 1 2 3 4 5 6 7 8 9 ; do for J in 0 1 2 3 4 5 6 7 8 9 ; do
OFFSET=$I$J ; dd if=/dev/zero of=/dev/rbd/ssdtest/test-mysql bs=1k seek=
$((OFFSET*4096)) count=4 ; done ; done
# ceph df | grep ssdtest
    ssdtest        10     3000M     0         852     

Here the "USED" column of "ceph df" is wrong. And on the disk I see
10226kB used.

So, for me the problem come from "ceph df" (and "rados df"), wich don't
correctly reports space used by partially writed object.

Or is it XFS related only ?

Le mercredi 11 septembre 2013 à 11:00 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> do you need more information about that ?
> 
> thanks,
> Olivier
> 
> Le mardi 10 septembre 2013 à 11:19 -0700, Samuel Just a écrit :
> > Can you post the rest of you crush map?
> > -Sam
> > 
> > On Tue, Sep 10, 2013 at 5:52 AM, Olivier Bonvalet <ceph.list@xxxxxxxxx> wrote:
> > > I also checked that all files in that PG still are on that PG :
> > >
> > > for IMG in `find . -type f -printf '%f\n' | awk -F '__' '{ print $1 }' |
> > > sort --unique` ; do echo -n "$IMG "; ceph osd map ssd3copies $IMG | grep
> > > -v 6\\.31f ; echo ; done
> > >
> > > And all objects are referenced in rados (compared with "rados --pool
> > > ssd3copies ls rados.ssd3copies.dump").
> > >
> > >
> > >
> > > Le mardi 10 septembre 2013 à 13:46 +0200, Olivier Bonvalet a écrit :
> > >> Some additionnal informations : if I look on one PG only, for example
> > >> the 6.31f. "ceph pg dump" report a size of 616GB :
> > >>
> > >> # ceph pg dump | grep ^6\\. | awk '{ SUM+=($6/1024/1024) } END { print SUM }'
> > >> 631717
> > >>
> > >> But on disk, on the 3 replica I have :
> > >> # du -sh  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
> > >> 1,3G  /var/lib/ceph/osd/ceph-50/current/6.31f_head/
> > >>
> > >> Since I was suspected a snapshot problem, I try to count only "head
> > >> files" :
> > >> # find /var/lib/ceph/osd/ceph-50/current/6.31f_head/ -type f -name '*head*' -print0 | xargs -r -0 du -hc | tail -n1
> > >> 448M  total
> > >>
> > >> and the content of the directory : http://pastebin.com/u73mTvjs
> > >>
> > >>
> > >> Le mardi 10 septembre 2013 à 10:31 +0200, Olivier Bonvalet a écrit :
> > >> > Hi,
> > >> >
> > >> > I have a space problem on a production cluster, like if there is unused
> > >> > data not freed : "ceph df" and "rados df" reports 613GB of data, and
> > >> > disk usage is 2640GB (with 3 replica). It should be near 1839GB.
> > >> >
> > >> >
> > >> > I have 5 hosts, 3 with SAS storage and 2 with SSD storage. I use crush
> > >> > rules to put pools on SAS or on SSD.
> > >> >
> > >> > My pools :
> > >> > # ceph osd dump | grep ^pool
> > >> > pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 576 pgp_num 576 last_change 68315 owner 0 crash_replay_interval 45
> > >> > pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 576 pgp_num 576 last_change 68317 owner 0
> > >> > pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 576 pgp_num 576 last_change 68321 owner 0
> > >> > pool 3 'hdd3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash rjenkins pg_num 200 pgp_num 200 last_change 172933 owner 0
> > >> > pool 6 'ssd3copies' rep size 3 min_size 1 crush_ruleset 7 object_hash rjenkins pg_num 800 pgp_num 800 last_change 172929 owner 0
> > >> > pool 9 'sas3copies' rep size 3 min_size 1 crush_ruleset 4 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 172935 owner 0
> > >> >
> > >> > Only hdd3copies, sas3copies and ssd3copies are really used :
> > >> > # ceph df
> > >> > GLOBAL:
> > >> >     SIZE       AVAIL      RAW USED     %RAW USED
> > >> >     76498G     51849G     24648G       32.22
> > >> >
> > >> > POOLS:
> > >> >     NAME           ID     USED      %USED     OBJECTS
> > >> >     data           0      46753     0         72
> > >> >     metadata       1      0         0         0
> > >> >     rbd            2      8         0         1
> > >> >     hdd3copies     3      2724G     3.56      5190954
> > >> >     ssd3copies     6      613G      0.80      347668
> > >> >     sas3copies     9      3692G     4.83      764394
> > >> >
> > >> >
> > >> > My CRUSH rules was :
> > >> >
> > >> > rule SASperHost {
> > >> >     ruleset 4
> > >> >     type replicated
> > >> >     min_size 1
> > >> >     max_size 10
> > >> >     step take SASroot
> > >> >     step chooseleaf firstn 0 type host
> > >> >     step emit
> > >> > }
> > >> >
> > >> > and :
> > >> >
> > >> > rule SSDperOSD {
> > >> >     ruleset 3
> > >> >     type replicated
> > >> >     min_size 1
> > >> >     max_size 10
> > >> >     step take SSDroot
> > >> >     step choose firstn 0 type osd
> > >> >     step emit
> > >> > }
> > >> >
> > >> >
> > >> > but, since the cluster was full because of that space problem, I swith to a different rule :
> > >> >
> > >> > rule SSDperOSDfirst {
> > >> >     ruleset 7
> > >> >     type replicated
> > >> >     min_size 1
> > >> >     max_size 10
> > >> >     step take SSDroot
> > >> >     step choose firstn 1 type osd
> > >> >     step emit
> > >> >         step take SASroot
> > >> >         step chooseleaf firstn -1 type net
> > >> >         step emit
> > >> > }
> > >> >
> > >> >
> > >> > So with that last rule, I should have only one replica on my SSD OSD, so 613GB of space used. But if I check on OSD I see 1212GB really used.
> > >> >
> > >> > I also use snapshots, maybe snapshots are ignored by "ceph df" and "rados df" ?
> > >> >
> > >> > Thanks for any help.
> > >> >
> > >> > Olivier
> > >> >
> > >> > _______________________________________________
> > >> > ceph-users mailing list
> > >> > ceph-users@xxxxxxxxxxxxxx
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >
> > >>
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com