Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Thu, 24 Mar 2022 09:18:53 +1300

I would not host multiple OSD on a spinning drive (unless it's one of those
Seagate MACH.2 drives that have two independent heads) - head seek time
will most likely kill performance. The main reason to host multiple OSD on
a single SSD or NVME is typically to make use of the large IOPS capacity
which cepth can struggle to fully utilize on a single drive. With spinners
you usually don't have that "problem" (quite the opposite usually)

On Wed, 23 Mar 2022 at 19:29, Boris Behrens <bb@xxxxxxxxx> wrote:

> Good morning Istvan,
> those are rotating disks and we don't use EC. Splitting up the 16TB disks
> into two 8TB partitions and have two OSDs on one disk also sounds
> interesting, but would it solve the problem?
>
> I also thought to adjust the PGs for the data pool from 4096 to 8192. But I
> am not sure if this will solve the problem or make it worse.
>
> Until now, everything I've tried didn't work.
>
> Am Mi., 23. März 2022 um 05:10 Uhr schrieb Szabo, Istvan (Agoda) <
> Istvan.Szabo@xxxxxxxxx>:
>
> > Hi,
> >
> > I think you are having similar issue as me in the past.
> >
> > I have 1.6B objects on a cluster average 40k and all my osd had spilled
> > over.
> >
> > Also slow ops, wrongly marked down…
> >
> > My osds are 15.3TB ssds, so my solution was to store block+db together on
> > the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg
> approx.
> > Also turned on balancer with upmap and max deviation 1.
> >
> > I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always
> the
> > pg number, too small pg number for too many objects.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---------------------------------------------------
> > Agoda Services Co., Ltd.
> > e: istvan.szabo@xxxxxxxxx
> > ---------------------------------------------------
> >
> > On 2022. Mar 22., at 23:34, Boris Behrens <bb@xxxxxxxxx> wrote:
> >
> > Email received from the internet. If in doubt, don't click any link nor
> > open any attachment !
> > ________________________________
> >
> > The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had
> cache
> > SSDs (not nvme though and most of them are 10OSDs one SSD) but this
> problem
> > only came in with octopus.
> >
> > We also thought this might be the db compactation, but it doesn't match
> up.
> > It might happen when the compactation run, but it looks also that it
> > happens, when there are other operations like table_file_deletion
> > and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs
> > share one SAMSUNG MZ7KM1T9HAJM-00005 and the IOPS/throughput on the SSD
> is
> > not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and
> around
> > 50mb/s r/w throughput)
> >
> > I also can not reproduce it via "ceph tell osd.NN compact", so I am not
> > 100% sure it is the compactation.
> >
> > What do you mean with "grep for latency string"?
> >
> > Cheers
> > Boris
> >
> > Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin <
> > k0ste@xxxxxxxx>:
> >
> > 180PG per OSD is usually overhead, also 40k obj per PG is not much, but I
> >
> > don't think this will works without block.db NVMe. I think your "wrong
> out
> >
> > marks" evulate in time of rocksdb compaction. With default log settings
> you
> >
> > can try to grep 'latency' strings
> >
> >
> > Also, https://tracker.ceph.com/issues/50297
> >
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> >
> > On 22 Mar 2022, at 14:29, Boris Behrens <bb@xxxxxxxxx> wrote:
> >
> >
> > * the 8TB disks hold around 80-90 PGs (16TB around 160-180)
> >
> > * per PG we've around 40k objects 170m objects in 1.2PiB of storage
> >
> >
> >
> >
> > --
> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> > groÃƒ¼en Saal.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > ------------------------------
> > This message is confidential and is for the sole use of the intended
> > recipient(s). It may also be privileged or otherwise protected by
> copyright
> > or other legal rules. If you have received it by mistake please let us
> know
> > by reply email and delete it from your system. It is prohibited to copy
> > this message or disclose its content to anyone. Any confidentiality or
> > privilege is not waived or lost by any mistaken delivery or unauthorized
> > disclosure of the message. All messages sent to and from Agoda may be
> > monitored to ensure compliance with company policies, to protect the
> > company's interests and to remove potential malware. Electronic messages
> > may be intercepted, amended, lost or deleted, or contain viruses.
> >
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx