Hi We are currently facing severe latency issues in our Ceph cluster, particularly affecting read and write operations. At times, write operations completely stall, leading to significant service degradation. Below is a detailed breakdown of the issue, our observations, and the mitigation steps we have taken so far. We would greatly appreciate any insights or suggestions. Incident Timeline & Observations: Users reported extreme slowness in write and read operations. Network Issues: Packet drops and a high number of TCP retransmits were identified. TCP Connection Delays: Many TCP connections were stuck in the TIME-WAIT state. Initial Suspicions: We first suspected high list_bucket requests on RGW, but log analysis disproved this. Resharding Efforts: Some high-latency buckets were resharded, but this had minimal impact. HTTP 503 Errors: Indicating request overload and service instability. Earlier Related Incidents: Network Congestion: Routing issues previously caused congestion in the Provision network. RGW Downtime: Several RGWs went down, requiring removal from HAProxy. NIC Troubleshooting: A faulty NIC was replaced, but errors persisted. Troubleshooting & Mitigation Attempts: Network & Traffic Management: Investigated logs and blocked unnecessary list_bucket requests. Disabled HAProxy to isolate Ceph internal traffic → Ceph internals worked fine, but external requests still caused issues. RGW-Specific Troubleshooting: Restarted RGW instances → Temporary 20-minute improvement, but latency returned. Found an RGW node with incomplete configurations → Fixed, but issue resurfaced. TCP & Network Investigations: Adjusted tcp_mem settings → No substantial improvement. packet analysis showed a high number of duplicate packets (DUP ACKs). Observed high TCP retransmit errors and many TIME-WAIT connections. Assistance Requested: Could high TCP retransmits & TIME-WAIT connections indicate a deeper network issue affecting RGW writes? Are there recommended debugging techniques for tracing TCP-related issues in a Ceph RGW environment? Would tweaking KeepAlived settings help in case of incorrect VIP failover behavior? Could an underlying Ceph metadata issue be disproportionately affecting write operations? Any guidance, recommendations, or debugging steps would be immensely helpful. Thank you for your time and support! Best regards, Ramin _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx