Re: Severe Latency Issues in Ceph Cluster

Ramin Najjarbashi <ramin.najarbashi@xxxxxxxxx> · Mon, 3 Mar 2025 21:11:13 +0330

The Ceph version is 17.2.7.

• OSDs are a mix of SSD and HDD, with DB/WAL colocated on the same OSDs.

• SSDs are used for metadata and index pools with replication 3.

• HDDs store the data pool using EC 4+2.

Interestingly, the same issue has appeared on another cluster where DB/WAL
is placed on NVMe disks, but the pool distribution is the same: meta and
index on SSDs, and data on HDDs.

It seems to be network-related, as I’ve checked the interfaces, and there
are no obvious hardware or connectivity issues. However, we’re still seeing
a high number of retransmissions and duplicate packets on the network.

Let me know if you have any insights or suggestions.

On Mon, Mar 3, 2025 at 12:36 Stefan Kooman <stefan@xxxxxx> wrote:

> On 01-03-2025 15:10, Ramin Najjarbashi wrote:
> > Hi
> > We are currently facing severe latency issues in our Ceph cluster,
> > particularly affecting read and write operations. At times, write
> > operations completely stall, leading to significant service degradation.
> > Below is a detailed breakdown of the issue, our observations, and the
> > mitigation steps we have taken so far. We would greatly appreciate any
> > insights or suggestions.
>
> What ceph version?
>
> How are OSDs provisioned (WAL+DB, single OSD, etc.). Type of disks.
>
> Gr. Stefan
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx