I am not sure, if this problem was reported earlier. We debugged using rocksdb tools and observed the keys being repeated in some fashion, found out to be valid keys and some more observation on rgw nodes revealed resharding happening too frequently on the same buckets, multiple times without any IO on them. Then the later steps followed once we identified the problem. Our setup is having close to a billion objects and we were bitten badly by the dynamic resharding and aggressive balancer. Later we adjusted the balancer frequency to few hours and slowly we recovered the cluster to normal state.
Don't forget to run compaction either by admin socket or setting a config option to compact on mount and restarting the osds. We experienced some lookups hitting the suicide timeout on reading few keys and on rebalancing by balancer.
On Wed, Jan 9, 2019 at 6:14 PM Thomas Bennett <thomas@xxxxxxxxx> wrote:
Awesome, thanks!Is there an email thread that I can follow somewhere?Regards,Tom--On Wed, 09 Jan 2019 at 13:26, Varada Kari (System Engineer) <varadaraja.kari@xxxxxxxxxxxx> wrote:Hi,we also faced same situation as index pool getting full in Luminous(12.2.3) release. For us we have enabled bucket resharding which was running in a loop and filled up all the index osds and not deleting old reshard entries.As a resolution, we have disabled bucket resharding to arrest the problem and compiled latest Luminous code(12.2.11, which not released yet) and deleted all the old reshard entries. New command options were added to delete/purge the old reshard entries. After this step ran manual compaction on all the osds to reduce the read latencies on BlueFS.Regards,VaradaOn Wed, Jan 9, 2019 at 4:17 PM Thomas Bennett <thomas@xxxxxxxxx> wrote:_______________________________________________Hi Wido,Thanks for your reply.Are you storing a lot of (small) objects in the buckets?No. All objects are < 4MB, around 10MB.How much real data is there in the buckets data pool?Only 7% used - 0.4 PB.With 51 PGs on the NVMe you are on the low side, you will want to have
this hovering around 150 or even 200 on NVMe drives to get the best
performance.Thanks. Do you think this also relates to the large omaps?Cheers,
Ceph-large mailing list
Thomas BennettSARAOScience Data Processing
_______________________________________________ Ceph-large mailing list Ceph-large@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com