Severe Latency Issues in Ceph Cluster

Ramin Najjarbashi <ramin.najarbashi@xxxxxxxxx> · Sat, 1 Mar 2025 17:40:35 +0330

Hi
We are currently facing severe latency issues in our Ceph cluster,
particularly affecting read and write operations. At times, write
operations completely stall, leading to significant service degradation.
Below is a detailed breakdown of the issue, our observations, and the
mitigation steps we have taken so far. We would greatly appreciate any
insights or suggestions.

Incident Timeline & Observations:
Users reported extreme slowness in write and read operations.
Network Issues: Packet drops and a high number of TCP retransmits were
identified.
TCP Connection Delays: Many TCP connections were stuck in the TIME-WAIT
state.
Initial Suspicions: We first suspected high list_bucket requests on RGW,
but log analysis disproved this.
Resharding Efforts: Some high-latency buckets were resharded, but this had
minimal impact.
HTTP 503 Errors: Indicating request overload and service instability.

Earlier Related Incidents:
Network Congestion: Routing issues previously caused congestion in the
Provision network.
RGW Downtime: Several RGWs went down, requiring removal from HAProxy.
NIC Troubleshooting: A faulty NIC was replaced, but errors persisted.

Troubleshooting & Mitigation Attempts:
Network & Traffic Management:
Investigated logs and blocked unnecessary list_bucket requests.
Disabled HAProxy to isolate Ceph internal traffic → Ceph internals worked
fine, but external requests still caused issues.

RGW-Specific Troubleshooting:

Restarted RGW instances → Temporary 20-minute improvement, but latency
returned.
Found an RGW node with incomplete configurations → Fixed, but issue
resurfaced.

TCP & Network Investigations:

Adjusted tcp_mem settings → No substantial improvement.
packet analysis showed a high number of duplicate packets (DUP ACKs).
Observed high TCP retransmit errors and many TIME-WAIT connections.

Assistance Requested:
Could high TCP retransmits & TIME-WAIT connections indicate a deeper
network issue affecting RGW writes?
Are there recommended debugging techniques for tracing TCP-related issues
in a Ceph RGW environment?
Would tweaking KeepAlived settings help in case of incorrect VIP failover
behavior?
Could an underlying Ceph metadata issue be disproportionately affecting
write operations?

Any guidance, recommendations, or debugging steps would be immensely
helpful. Thank you for your time and support!

Best regards,
Ramin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx