Re: Big usage of db.slow

Rafał Wądołowski <rwadolowski@xxxxxxxxxxxxxx> · Mon, 16 Apr 2018 17:41:13 +0200



On 16.04.2018 16:46, Wido den Hollander wrote:
>
> On 04/16/2018 03:04 PM, Rafał Wądołowski wrote:
>> Hi,
>>
>> We're using ceph as object storage. Several days ago we noticed that
>> listing operation is very slow. Command ceph daemon osd.ID perf dump
>> showed us a very big usage of db.slow. I aggregate output from servers:
>>
>> SUM DB used: 217.29 GiB SUM SLOW used= 1.25 TiB SUM WAL used= 75.14 GiB
>> SUM DB used: 121.91 GiB SUM SLOW used= 1.12 TiB SUM WAL used= 54.18 GiB
>> SUM DB used: 121.84 GiB SUM SLOW used= 1.21 TiB SUM WAL used= 58.72 GiB
>> SUM DB used: 122.43 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 40.67 GiB
>> SUM DB used: 123.22 GiB SUM SLOW used= 1.19 TiB SUM WAL used= 54.62 GiB
>> SUM DB used: 122.43 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 33.62 GiB
>> SUM DB used: 126.79 GiB SUM SLOW used= 1.24 TiB SUM WAL used= 72.45 GiB
>> SUM DB used: 121.30 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 52.59 GiB
>> SUM DB used: 115.57 GiB SUM SLOW used= 1.14 TiB SUM WAL used= 50.37 GiB
>> SUM DB used: 126.06 GiB SUM SLOW used= 1.23 TiB SUM WAL used= 60.08 GiB
>> SUM DB used: 121.28 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 46.64 GiB
>> SUM DB used: 122.54 GiB SUM SLOW used= 1.09 TiB SUM WAL used= 47.87 GiB
>> SUM DB used: 122.04 GiB SUM SLOW used= 1.15 TiB SUM WAL used= 35.18 GiB
>> SUM DB used: 138.03 GiB SUM SLOW used= 1.04 TiB SUM WAL used= 36.01 GiB
>> SUM DB used: 138.72 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 33.95 GiB
>> SUM DB used: 126.25 GiB SUM SLOW used= 1.15 TiB SUM WAL used= 43.55 GiB
>> SUM DB used: 119.74 GiB SUM SLOW used= 1.17 TiB SUM WAL used= 50.96 GiB
>> SUM DB used: 143.98 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 34.37 GiB
>> SUM DB used: 135.29 GiB SUM SLOW used= 1.12 TiB SUM WAL used= 46.46 GiB
>>
>> We have about 500M objects in 75 buckets.  I think that this value is
>> too big, am I correct? What data is stored in rocksdb, that takes so
>> much space? Is there any parameters, triggers, which will lower used space?
>>
> It is a lot of data, yes, but the RocksDB of BlueStore stores the
> pointers of each object. Where a object is located on disk.
>
> Now, from my first tests I saw that a object in BlueStore roughly has a
> 22k overhead.
>
> You have 500M objects, so that means you would have ~10TB of overhead.
>
> Now, I see you use EC and that might complicate things. I haven't tested
> it yet, but your profile seems to be EC 4+2?
>
> My first idea is that you will have 6 chunks each having ~22k overhead.
>
> 500M * 22k * 6 = 60TB of metadata.
>
> Personally I think it's a lot of overhead, but for now this is what I
> have seen in my tests and experience.
Yes, we have EC 4+2.
I am wondering, why there is so big overhead... Maybe developer could
clarify this?
It's a lot of space and that generates costs..

>
>> We have 19 nodes + 3 (mgr+rgw+mon) nodes. Each osd node has 34x8TB drive
>> and 2x480GB NVMe, where each osd has 20GB for rocksDB and 4GB for WAL.
>> We're using Ceph 12.2.4 installed with ceph-ansible.
>>
> So to double check, you have 646 OSDs in total?
Yes, you're right!
Maybe is it good idea to create 3 new servers with only SATA SSD and
change index pool to them?

>
> Wido
>
>> Our pools configuration:
>>
>> pool 1 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
>> rjenkins pg_num 8 pgp_num 8 last_change 125197 owner
>> 18446744073709551615 flags hashpspool stripe_width 0 application rgw
>> pool 2 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner
>> 18446744073709551615 flags hashpspool stripe_width 0 application rgw
>> pool 3 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner
>> 18446744073709551615 flags hashpspool stripe_width 0 application rgw
>> pool 4 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner
>> 18446744073709551615 flags hashpspool stripe_width 0 application rgw
>> pool 5 'default.rgw.buckets.data' erasure size 6 min_size 4 crush_rule 1
>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 125197 lfor
>> 0/79930 flags hashpspool stripe_width 16384 compression_algorithm snappy
>> compression_mode force application rgw
>> pool 6 'slow_drives' erasure size 6 min_size 4 crush_rule 2 object_hash
>> rjenkins pg_num 2048 pgp_num 2048 last_change 125197 lfor 0/2496 flags
>> hashpspool stripe_width 16384 compression_algorithm snappy
>> compression_mode force application rgw
>> pool 7 'default.rgw.buckets.index' replicated size 3 min_size 2
>> crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change
>> 125197 lfor 0/115553 owner 18446744073709551615 flags hashpspool
>> stripe_width 0 application rgw
>> pool 8 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
>> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197
>> flags hashpspool stripe_width 0 application rgw
>>
>>
>> Thank you for your help
>>
>> Cheers,
>> Rafal Wadolowski
>>
>> _______________________________________________
>> Ceph-large mailing list
>> Ceph-large@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
>>
> _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com


_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com