Hi, We have a big problem with RGW. I don't know what is the initial trigger, but i have theory. 2-3 osd, from 78 in cluster (6480 PG on RGW pool), have 3x time more RAM usage, they have much more operations in journal, and much bigger latency. When we PUT some objects then in some cases, there are so many operations in triple replication on this osd (one PG). Then this triple can't handle this load, and goes down, drives on backend of this osd are getting fire with big wait-io, and big response times. RGW waiting for this PG, and eventually block all the others operations when makes 1024 operations blocked in queue. Then whole cluster have problems, and we have an outage. When RGW block operations there is only one PG that have >1000 operations in queue - ceph pg map 3.9447554d osdmap e11404 pg 3.9447554d (3.54d) -> up [53,45,23] acting [53,45,23] now this osd are migrated, with ratio 0.5 on, but before it was ceph pg map 3.9447554d osdmap e11404 pg 3.9447554d (3.54d) -> up [71,45,23] acting [71,45,23] and this three osd's have such a problems. Under this osd's are only 3 drive, one drive per osd, that's why this have such a big impact. What i done. I gave 50% smaller ratio in CRUSH for this osd's, but data move to other osd, and this osd, have half of possible capacity. I think it won't help in long term, and it's not a solution. I have second cluster, with only replication on it, and there are same case. Attachment explain everything. Every parameter on this bad osd is much higher than on others. There are 2-3 osd with such high counters. Is this a bug ?? maybe there is no problems in bobtail ?? I can't switch quick into bobtail that's why i need some answers, which way i need to go. Best Regards Slawomir Skowron
Attachment:
bad_osd.png
Description: PNG image
Attachment:
good_osds.png
Description: PNG image