On 16.04.2018 16:46, Wido den Hollander wrote: > > On 04/16/2018 03:04 PM, Rafał Wądołowski wrote: >> Hi, >> >> We're using ceph as object storage. Several days ago we noticed that >> listing operation is very slow. Command ceph daemon osd.ID perf dump >> showed us a very big usage of db.slow. I aggregate output from servers: >> >> SUM DB used: 217.29 GiB SUM SLOW used= 1.25 TiB SUM WAL used= 75.14 GiB >> SUM DB used: 121.91 GiB SUM SLOW used= 1.12 TiB SUM WAL used= 54.18 GiB >> SUM DB used: 121.84 GiB SUM SLOW used= 1.21 TiB SUM WAL used= 58.72 GiB >> SUM DB used: 122.43 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 40.67 GiB >> SUM DB used: 123.22 GiB SUM SLOW used= 1.19 TiB SUM WAL used= 54.62 GiB >> SUM DB used: 122.43 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 33.62 GiB >> SUM DB used: 126.79 GiB SUM SLOW used= 1.24 TiB SUM WAL used= 72.45 GiB >> SUM DB used: 121.30 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 52.59 GiB >> SUM DB used: 115.57 GiB SUM SLOW used= 1.14 TiB SUM WAL used= 50.37 GiB >> SUM DB used: 126.06 GiB SUM SLOW used= 1.23 TiB SUM WAL used= 60.08 GiB >> SUM DB used: 121.28 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 46.64 GiB >> SUM DB used: 122.54 GiB SUM SLOW used= 1.09 TiB SUM WAL used= 47.87 GiB >> SUM DB used: 122.04 GiB SUM SLOW used= 1.15 TiB SUM WAL used= 35.18 GiB >> SUM DB used: 138.03 GiB SUM SLOW used= 1.04 TiB SUM WAL used= 36.01 GiB >> SUM DB used: 138.72 GiB SUM SLOW used= 1.08 TiB SUM WAL used= 33.95 GiB >> SUM DB used: 126.25 GiB SUM SLOW used= 1.15 TiB SUM WAL used= 43.55 GiB >> SUM DB used: 119.74 GiB SUM SLOW used= 1.17 TiB SUM WAL used= 50.96 GiB >> SUM DB used: 143.98 GiB SUM SLOW used= 1.01 TiB SUM WAL used= 34.37 GiB >> SUM DB used: 135.29 GiB SUM SLOW used= 1.12 TiB SUM WAL used= 46.46 GiB >> >> We have about 500M objects in 75 buckets. I think that this value is >> too big, am I correct? What data is stored in rocksdb, that takes so >> much space? Is there any parameters, triggers, which will lower used space? >> > It is a lot of data, yes, but the RocksDB of BlueStore stores the > pointers of each object. Where a object is located on disk. > > Now, from my first tests I saw that a object in BlueStore roughly has a > 22k overhead. > > You have 500M objects, so that means you would have ~10TB of overhead. > > Now, I see you use EC and that might complicate things. I haven't tested > it yet, but your profile seems to be EC 4+2? > > My first idea is that you will have 6 chunks each having ~22k overhead. > > 500M * 22k * 6 = 60TB of metadata. > > Personally I think it's a lot of overhead, but for now this is what I > have seen in my tests and experience. Yes, we have EC 4+2. I am wondering, why there is so big overhead... Maybe developer could clarify this? It's a lot of space and that generates costs.. > >> We have 19 nodes + 3 (mgr+rgw+mon) nodes. Each osd node has 34x8TB drive >> and 2x480GB NVMe, where each osd has 20GB for rocksDB and 4GB for WAL. >> We're using Ceph 12.2.4 installed with ceph-ansible. >> > So to double check, you have 646 OSDs in total? Yes, you're right! Maybe is it good idea to create 3 new servers with only SATA SSD and change index pool to them? > > Wido > >> Our pools configuration: >> >> pool 1 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash >> rjenkins pg_num 8 pgp_num 8 last_change 125197 owner >> 18446744073709551615 flags hashpspool stripe_width 0 application rgw >> pool 2 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner >> 18446744073709551615 flags hashpspool stripe_width 0 application rgw >> pool 3 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner >> 18446744073709551615 flags hashpspool stripe_width 0 application rgw >> pool 4 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 owner >> 18446744073709551615 flags hashpspool stripe_width 0 application rgw >> pool 5 'default.rgw.buckets.data' erasure size 6 min_size 4 crush_rule 1 >> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 125197 lfor >> 0/79930 flags hashpspool stripe_width 16384 compression_algorithm snappy >> compression_mode force application rgw >> pool 6 'slow_drives' erasure size 6 min_size 4 crush_rule 2 object_hash >> rjenkins pg_num 2048 pgp_num 2048 last_change 125197 lfor 0/2496 flags >> hashpspool stripe_width 16384 compression_algorithm snappy >> compression_mode force application rgw >> pool 7 'default.rgw.buckets.index' replicated size 3 min_size 2 >> crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change >> 125197 lfor 0/115553 owner 18446744073709551615 flags hashpspool >> stripe_width 0 application rgw >> pool 8 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 >> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 125197 >> flags hashpspool stripe_width 0 application rgw >> >> >> Thank you for your help >> >> Cheers, >> Rafal Wadolowski >> >> _______________________________________________ >> Ceph-large mailing list >> Ceph-large@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com >> > _______________________________________________ > Ceph-large mailing list > Ceph-large@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com _______________________________________________ Ceph-large mailing list Ceph-large@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com