Ceph with Cache pool - disk usage / cleanup

Sascha Vogt <sascha.vogt@xxxxxxxxx> · Wed, 28 Sep 2016 14:08:43 +0200

Hi all,

we currently experience a few "strange" things on our Ceph cluster and I
wanted to ask if anyone has recommendations for further tracking them
down (or maybe even an explanation already ;) )

Ceph version is 0.94.5 and we have a HDD based pool with a cache pool on
NVMe SSDs in front if it.

ceph df detail lists a "used" size on the ssd pool (the cache) of
currently 3815 GB. We have a replication size of 2, so effectively this
should take around 7670 GB on disk. Duing a df on all OSDs and summing
them up gives 8501 GB, which is 871 GB more than expected.

Last week the difference was around 840 GB, the week before that around
780 GB. So it looks like the difference is constantly growing.

Doing a for date in `ceph pg dump | grep active | awk '{print $20}'`; do
date +%A -d $date; done | sort | uniq -c

Returns

2002 Tuesday
1390 Wednesday

So scrubbing and deepscrubbing is regularly done.

A thing I noticed which might or might not be related is the following:
The pool is used for OpenStack ephemeral disks and I had created a 1 TB
VM (1TB ephemeral, not a cinder volume ;) )

I looked up the RBD device and noted down the block prefix name.

> rbd info ephemeral-vms/0edd1080-9f84-48d2-8714-34b1cd7d50df_disk
> rbd image '0edd1080-9f84-48d2-8714-34b1cd7d50df_disk':
>         size 1024 GB in 262144 objects
>         order 22 (4096 kB objects)
>         block_name_prefix: rbd_data.2c383a0238e1f29
>         format: 2
>         features: layering
>         flags:

After I had deleted the VM I regularly checked the amount of objects in
rados via "rados -p ephemeral-vms ls | grep rbd_data.2c383a0238e1f29 |
wc -l"

and it still returns a large amount of objects:

> Mon Sep 19 09:10:43 CEST 2016 - 138937
> Tue Sep 20 16:11:55 CEST 2016 - 135818
> Thu Sep 22 09:59:03 CEST 2016 - 135791
> Wed Sep 28 12:15:07 CEST 2016 - 133862

I did a "stat" AND a "rm" on each and every of those objects, but they
all returned:

>  rados -p ephemeral-vms stat rbd_data.2c383a0238e1f29.000000000000f8b8
>  error stat-ing ephemeral-vms/rbd_data.2c383a0238e1f29.000000000000f8b8: (2) No such file or directory

So why is rados still return those objects via an ls?

Even worse, counting the objects on the ssd pool I get:
rados -p ssd ls | grep rbd_data.2c383a0238e1f29 | wc -l
Wed Sep 28 12:54:07 CEST 2016 - 246681

I did a find on one of the OSDs data dir:
> find . -name "*data.2c383a0238e1f29*" | wc -l
> 33060

And checked a few, all of them very 0-byte files

e.g.
> ls -lha ./11.1d_head/DIR_D/DIR_1/DIR_0/DIR_7/DIR_9/rbd\\udata.2c383a0238e1f29.0000000000019bf7__head_87C9701D__b
> -rw-r--r-- 1 root root 0 Sep  9 11:21 ./11.1d_head/DIR_D/DIR_1/DIR_0/DIR_7/DIR_9/rbd\udata.2c383a0238e1f29.0000000000019bf7__head_87C9701D__b

But even a 0-byte file takes some space on the disk, might those be the
reason?

Any feedback welcome.
Greetings
-Sascha-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com