Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Wed, 23 Mar 2022 04:10:55 +0000

Hi,

I think you are having similar issue as me in the past.

I have 1.6B objects on a cluster average 40k and all my osd had spilled over.

Also slow ops, wrongly marked down…

My osds are 15.3TB ssds, so my solution was to store block+db together on the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg approx.
Also turned on balancer with upmap and max deviation 1.

I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always the pg number, too small pg number for too many objects.

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2022. Mar 22., at 23:34, Boris Behrens <bb@xxxxxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had cache
SSDs (not nvme though and most of them are 10OSDs one SSD) but this problem
only came in with octopus.

We also thought this might be the db compactation, but it doesn't match up.
It might happen when the compactation run, but it looks also that it
happens, when there are other operations like table_file_deletion
and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
share one SAMSUNG MZ7KM1T9HAJM-00005 and the IOPS/throughput on the SSD is
not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and around
50mb/s r/w throughput)

I also can not reproduce it via "ceph tell osd.NN compact", so I am not
100% sure it is the compactation.

What do you mean with "grep for latency string"?

Cheers
Boris

Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
k0ste@xxxxxxxx>:

180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
don't think this will works without block.db NVMe. I think your "wrong out
marks" evulate in time of rocksdb compaction. With default log settings you
can try to grep 'latency' strings

Also, https://tracker.ceph.com/issues/50297

k
Sent from my iPhone

On 22 Mar 2022, at 14:29, Boris Behrens <bb@xxxxxxxxx> wrote:

* the 8TB disks hold around 80-90 PGs (16TB around 160-180)
* per PG we've around 40k objects 170m objects in 1.2PiB of storage

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx