Re: Does Replica Count Affect Tell Bench Result or Not?

Erik Lindahl <erik.lindahl@xxxxxxxxx> · Wed, 28 Dec 2022 17:19:50 +0100

Hi,

Just to add to the previous discussion, consumer SSDs like these can unfortunately be significantly *slower* than plain old HDDs for Ceph. This is because Ceph always uses SYNC writes to guarantee that data is on disk before returning.

Unfortunately NAND writes are intrinsically quite slow, and tri/quad-level SSDs are the worst of them all. Enterprise SSDs solve this by having power-loss-protection capacitors, which means they can safely return the data as written the second it is in the fast RAM on the device.

Cheap consumer SSDs fall in one of two categories:

1. The drive might lie and return the data as written as soon as it's in the write cache when a SYNC write is requested. This gives seemingly great performance ... until you have a power loss and your data is corrupted. Thankfully, very few drives do this today.
2. The drive treats the SYNC write correctly, which means it can't return until the request has been moved from cache to actual NAND memory, which is (very) slow.

The short story is likely that all drives without power-loss-protection should be avoided, because if the performance looks great, it might mean the drive falls in category #1 instead of being a magical & cheap solution.

There is unfortunately no inherent "best" SSD, but it depends on your usage. For instance, for our large data partitions we need a lot of space and high read performance, but we don't store/update the data that frequently, so we opted for Samsung PM883 drives that are only designed for 0.8 DWPD (disk-writes-per-day). In contrast, for metadata drives where we have more writes (but don't need a ton of storage), we use drives that can handle 3DWPD, like Samsung sm893.

Virtually all vendors have such different lines of drives, so you will need to start by analyzing how much data you expect to write per day relative to the total storage volume and get appropriate drives.

If you are operating a very read/write-intensive cluster with hundreds of operations in parallel you will benefit a lot from higher-IOPS-rate drives, but be aware that those theoretical numbers listed are typically only achieved for very large queue depths (i.e., always having 32-64 operations running in parallel).

Since you are currently using consumer SSD (which definitely don't have endurance to handle intensive IO anyway), my guess is that you might rather have a lower-end setup, and then good performance depends more on having consistent low latency for all operations (including to/from the network cards).

If I were to invest in new servers today, I would likely go with NVMe, mostly because it's the future and not *that* much more expensive, but for old servers almost any enterprise-class SSD with power-loss-protection from major vendors should be fine - but you need to analyse whether you need write-intensive disks or not.

Cheers,

Erik

--
Erik Lindahl <erik.lindahl@xxxxxxxxx>
On 28 Dec 2022 at 08:44 +0100, hosseinz8050@xxxxxxxxx <hosseinz8050@xxxxxxxxx>, wrote:
> Thanks. I am planning to change all of my disks. But do you know enterprise SSD Disk which is best in trade of between cost & iops performance?Which model and brand.Thanks in advance.
> On Wednesday, December 28, 2022 at 08:44:34 AM GMT+3:30, Konstantin Shalygin <k0ste@xxxxxxxx> wrote:
>
> Hi,
>
> The cache was gone, optimize is proceed. This is not enterprise device, you should never use it with Ceph 🙂
>
>
> k
> Sent from my iPhone
>
> > On 27 Dec 2022, at 16:41, hosseinz8050@xxxxxxxxx wrote:
> >
> > Thanks AnthonyI have a cluster with QLC SSD disks (Samsung QVO 860). The cluster works for 2 year. Now all OSDs return 12 iops when running tell bench which is very slow. But I Buy new QVO disks yesterday, and I added this new disk to cluster. For the first 1 hour, I got 100 iops from this new OSD. But after 1 Hour, this new disk (OSD) returns to iops 12 again as the same as other OLD OSDs.I can not imagine what happening?!!
> >     On Tuesday, December 27, 2022 at 12:18:07 AM GMT+3:30, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> >
> > My understanding is that when you ask an OSD to bench (via the admin socket), only that OSD executes, there is no replication.  Replication is a function of PGs.
> >
> > Thus, this is a narrowly-focused tool with both unique advantages and disadvantages.
> >
> >
> >
> > > > On Dec 26, 2022, at 12:47 PM, hosseinz8050@xxxxxxxxx wrote:
> > > >
> > > > Hi experts,I want to know, when I execute ceph tell osd.x bench command, is replica 3 considered in the bench or not? I mean, for example in case of replica 3, when I executing tell bench command, replica 1 of bench data write to osd.x, replica 2 write to osd.y and replica 3 write to osd.z? If this is true, it means that I can not get benchmark of only one of my OSD in the cluster because the IOPS and throughput of 2 other for example slow OSDs will affect the result of tell bench command for my target OSD.Is that true?
> > > > Thanks in advance.
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx