Re: Big usage of db.slow

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 04/16/2018 03:04 PM, Rafał Wądołowski wrote:
> Hi,
> 
> We're using ceph as object storage. Several days ago we noticed that
> listing operation is very slow. Command ceph daemon osd.ID perf dump
> showed us a very big usage of db.slow. I aggregate output from servers:
> 
> SUM DB used: 217.29 GiB SUM SLOW used= 1.25 TiB SUM WAL used= 75.14 GiB
> SUM DB used: 121.91 GiB SUM SLOW used= 1.12 TiB SUM WAL used= 54.18 GiB
> SUM DB used: 121.84 GiB SUM SLOW used= 1.21 TiB SUM WAL used= 58.72 GiB
> SUM DB used: 122.43 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 40.67 GiB
> SUM DB used: 123.22 GiB SUM SLOW used= 1.19 TiB SUM WAL used= 54.62 GiB
> SUM DB used: 122.43 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 33.62 GiB
> SUM DB used: 126.79 GiB SUM SLOW used= 1.24 TiB SUM WAL used= 72.45 GiB
> SUM DB used: 121.30 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 52.59 GiB
> SUM DB used: 115.57 GiB SUM SLOW used= 1.14 TiB SUM WAL used= 50.37 GiB
> SUM DB used: 126.06 GiB SUM SLOW used= 1.23 TiB SUM WAL used= 60.08 GiB
> SUM DB used: 121.28 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 46.64 GiB
> SUM DB used: 122.54 GiB SUM SLOW used= 1.09 TiB SUM WAL used= 47.87 GiB
> SUM DB used: 122.04 GiB SUM SLOW used= 1.15 TiB SUM WAL used= 35.18 GiB
> SUM DB used: 138.03 GiB SUM SLOW used= 1.04 TiB SUM WAL used= 36.01 GiB
> SUM DB used: 138.72 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 33.95 GiB
> SUM DB used: 126.25 GiB SUM SLOW used= 1.15 TiB SUM WAL used= 43.55 GiB
> SUM DB used: 119.74 GiB SUM SLOW used= 1.17 TiB SUM WAL used= 50.96 GiB
> SUM DB used: 143.98 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 34.37 GiB
> SUM DB used: 135.29 GiB SUM SLOW used= 1.12 TiB SUM WAL used= 46.46 GiB
> 
> We have about 500M objects in 75 buckets.  I think that this value is
> too big, am I correct? What data is stored in rocksdb, that takes so
> much space? Is there any parameters, triggers, which will lower used space?
> 

It is a lot of data, yes, but the RocksDB of BlueStore stores the
pointers of each object. Where a object is located on disk.

Now, from my first tests I saw that a object in BlueStore roughly has a
22k overhead.

You have 500M objects, so that means you would have ~10TB of overhead.

Now, I see you use EC and that might complicate things. I haven't tested
it yet, but your profile seems to be EC 4+2?

My first idea is that you will have 6 chunks each having ~22k overhead.

500M * 22k * 6 = 60TB of metadata.

Personally I think it's a lot of overhead, but for now this is what I
have seen in my tests and experience.

> We have 19 nodes + 3 (mgr+rgw+mon) nodes. Each osd node has 34x8TB drive
> and 2x480GB NVMe, where each osd has 20GB for rocksDB and 4GB for WAL.
> We're using Ceph 12.2.4 installed with ceph-ansible.
> 

So to double check, you have 646 OSDs in total?

Wido

> Our pools configuration:
> 
> pool 1 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 8 pgp_num 8 last_change 125197 owner
> 18446744073709551615 flags hashpspool stripe_width 0 application rgw
> pool 2 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner
> 18446744073709551615 flags hashpspool stripe_width 0 application rgw
> pool 3 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner
> 18446744073709551615 flags hashpspool stripe_width 0 application rgw
> pool 4 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner
> 18446744073709551615 flags hashpspool stripe_width 0 application rgw
> pool 5 'default.rgw.buckets.data' erasure size 6 min_size 4 crush_rule 1
> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 125197 lfor
> 0/79930 flags hashpspool stripe_width 16384 compression_algorithm snappy
> compression_mode force application rgw
> pool 6 'slow_drives' erasure size 6 min_size 4 crush_rule 2 object_hash
> rjenkins pg_num 2048 pgp_num 2048 last_change 125197 lfor 0/2496 flags
> hashpspool stripe_width 16384 compression_algorithm snappy
> compression_mode force application rgw
> pool 7 'default.rgw.buckets.index' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change
> 125197 lfor 0/115553 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 8 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197
> flags hashpspool stripe_width 0 application rgw
> 
> 
> Thank you for your help
> 
> Cheers,
> Rafal Wadolowski
> 
> _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
> 
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFS]

  Powered by Linux