Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

Boris Behrens <bb@xxxxxxxxx> · Wed, 23 Mar 2022 13:32:27 +0100

You mean in the OSD logfiles?

Am Mi., 23. März 2022 um 08:23 Uhr schrieb Szabo, Istvan (Agoda) <
Istvan.Szabo@xxxxxxxxx>:

> Can you see in the pg dump like waiting for reading or something like
> this? In each step how much time it spends?
>
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx
> ---------------------------------------------------
>
>
>
> *From:* Boris Behrens <bb@xxxxxxxxx>
> *Sent:* Wednesday, March 23, 2022 1:29 PM
> *To:* Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> *Cc:* ceph-users@xxxxxxx
> *Subject:* Re:  Re: octopus (15.2.16) OSDs crash or don't
> answer heathbeats (and get marked as down)
>
>
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> ------------------------------
>
> Good morning Istvan,
>
> those are rotating disks and we don't use EC. Splitting up the 16TB disks
> into two 8TB partitions and have two OSDs on one disk also sounds
> interesting, but would it solve the problem?
>
>
>
> I also thought to adjust the PGs for the data pool from 4096 to 8192. But
> I am not sure if this will solve the problem or make it worse.
>
>
>
> Until now, everything I've tried didn't work.
>
>
>
> Am Mi., 23. März 2022 um 05:10 Uhr schrieb Szabo, Istvan (Agoda) <
> Istvan.Szabo@xxxxxxxxx>:
>
> Hi,
>
>
>
> I think you are having similar issue as me in the past.
>
>
>
> I have 1.6B objects on a cluster average 40k and all my osd had spilled
> over.
>
>
>
> Also slow ops, wrongly marked down…
>
>
>
> My osds are 15.3TB ssds, so my solution was to store block+db together on
> the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg approx.
>
> Also turned on balancer with upmap and max deviation 1.
>
>
>
> I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always the
> pg number, too small pg number for too many objects.
>
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx
> ---------------------------------------------------
>
>
>
> On 2022. Mar 22., at 23:34, Boris Behrens <bb@xxxxxxxxx> wrote:
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> ________________________________
>
> The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had cache
> SSDs (not nvme though and most of them are 10OSDs one SSD) but this problem
> only came in with octopus.
>
> We also thought this might be the db compactation, but it doesn't match up.
> It might happen when the compactation run, but it looks also that it
> happens, when there are other operations like table_file_deletion
> and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
> share one SAMSUNG MZ7KM1T9HAJM-00005 and the IOPS/throughput on the SSD is
> not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and around
> 50mb/s r/w throughput)
>
> I also can not reproduce it via "ceph tell osd.NN compact", so I am not
> 100% sure it is the compactation.
>
> What do you mean with "grep for latency string"?
>
> Cheers
> Boris
>
> Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
> k0ste@xxxxxxxx>:
>
>
> 180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
>
> don't think this will works without block.db NVMe. I think your "wrong out
>
> marks" evulate in time of rocksdb compaction. With default log settings you
>
> can try to grep 'latency' strings
>
>
>
> Also, https://tracker.ceph.com/issues/50297
>
>
>
>
>
> k
>
> Sent from my iPhone
>
>
>
> On 22 Mar 2022, at 14:29, Boris Behrens <bb@xxxxxxxxx> wrote:
>
>
>
> * the 8TB disks hold around 80-90 PGs (16TB around 160-180)
>
> * per PG we've around 40k objects 170m objects in 1.2PiB of storage
>
>
>
>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> ------------------------------
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
>
>
>
> --
>
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx