Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Wed, 23 Mar 2022 07:23:40 +0000

Can you see in the pg dump like waiting for reading or something like this? In each step how much time it spends?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

From: Boris Behrens <bb@xxxxxxxxx>
Sent: Wednesday, March 23, 2022 1:29 PM
To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________
Good morning Istvan,
those are rotating disks and we don't use EC. Splitting up the 16TB disks into two 8TB partitions and have two OSDs on one disk also sounds interesting, but would it solve the problem?

I also thought to adjust the PGs for the data pool from 4096 to 8192. But I am not sure if this will solve the problem or make it worse.

Until now, everything I've tried didn't work.

Am Mi., 23. März 2022 um 05:10 Uhr schrieb Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx<mailto:Istvan.Szabo@xxxxxxxxx>>:
Hi,

I think you are having similar issue as me in the past.

I have 1.6B objects on a cluster average 40k and all my osd had spilled over.

Also slow ops, wrongly marked down…

My osds are 15.3TB ssds, so my solution was to store block+db together on the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg approx.
Also turned on balancer with upmap and max deviation 1.

I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always the pg number, too small pg number for too many objects.

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2022. Mar 22., at 23:34, Boris Behrens <bb@xxxxxxxxx<mailto:bb@xxxxxxxxx>> wrote:
Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had cache
SSDs (not nvme though and most of them are 10OSDs one SSD) but this problem
only came in with octopus.

We also thought this might be the db compactation, but it doesn't match up.
It might happen when the compactation run, but it looks also that it
happens, when there are other operations like table_file_deletion
and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
share one SAMSUNG MZ7KM1T9HAJM-00005 and the IOPS/throughput on the SSD is
not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and around
50mb/s r/w throughput)

I also can not reproduce it via "ceph tell osd.NN compact", so I am not
100% sure it is the compactation.

What do you mean with "grep for latency string"?

Cheers
Boris

Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
k0ste@xxxxxxxx<mailto:k0ste@xxxxxxxx>>:

180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
don't think this will works without block.db NVMe. I think your "wrong out
marks" evulate in time of rocksdb compaction. With default log settings you
can try to grep 'latency' strings

Also, https://tracker.ceph.com/issues/50297

k
Sent from my iPhone

On 22 Mar 2022, at 14:29, Boris Behrens <bb@xxxxxxxxx<mailto:bb@xxxxxxxxx>> wrote:

* the 8TB disks hold around 80-90 PGs (16TB around 160-180)
* per PG we've around 40k objects 170m objects in 1.2PiB of storage

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx