Re: Choosing suitable SSD for Ceph cluster

Paul Emmerich <paul.emmerich@xxxxxxxx> · Fri, 25 Oct 2019 13:07:21 +0200

Disabling write cache helps with the 970 Pro, but it still sucks. I've
worked on a setup with heavy metadata requirements (gigantic S3
buckets being listed) that unfortunately had all of that stored on 970
Pros and that never really worked out.

Just get a proper SSD like the 883, 983, or 1725. The (tiny) price
difference vs. the consumer disks just isn't worth the hassle and the
problems you are going to run into.

Paul

On Thu, Oct 24, 2019 at 9:08 PM Hermann Himmelbauer <hermann@xxxxxxx> wrote:
>
> Hi,
> I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
> 3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
> interconnected via Infiniband 40.
>
> Problem is that the ceph performance is quite bad (approx. 30MiB/s
> reading, 3-4 MiB/s writing ), so I thought about plugging into each node
> a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
> have a faster ceph storage and also some storage extension.
>
> The question is now which SSDs I should use. If I understand it right,
> not every SSD is suitable for ceph, as is denoted at the links below:
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> or here:
> https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
>
> In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
> fast SSD for ceph. As the 950 is not available anymore, I ordered a
> Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.
>
> Before equipping all nodes with these SSDs, I did some tests with "fio"
> as recommended, e.g. like this:
>
> fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
> --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
> --name=journal-test
>
> The results are as the following:
>
> -----------------------
> 1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
> Jobs: 1:
> read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
> write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec
>
> Jobs: 4:
> read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
> write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec
>
> Jobs: 10:
> read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
> write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
> -----------------------
>
> So the read speed is impressive, but the write speed is really bad.
>
> Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
> chips (MLC instead of TLC). The results are, however even worse for writing:
>
> -----------------------
> Samsung 970 PRO NVMe M.2 mit PCIe Adapter
> Jobs: 1:
> read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
> write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec
>
> Jobs: 4:
> read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
> write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec
>
> Jobs: 10:
> read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
> write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
> -----------------------
>
> I did some research and found out, that the "--sync" flag sets the flag
> "O_DSYNC" which seems to disable the SSD cache which leads to these
> horrid write speeds.
>
> It seems that this relates to the fact that the write cache is only not
> disabled for SSDs which implement some kind of battery buffer that
> guarantees a data flush to the flash in case of a powerloss.
>
> However, It seems impossible to find out which SSDs do have this
> powerloss protection, moreover, these enterprise SSDs are crazy
> expensive compared to the SSDs above - moreover it's unclear if
> powerloss protection is even available in the NVMe form factor. So
> building a 1 or 2 TB cluster seems not really affordable/viable.
>
> So, can please anyone give me hints what to do? Is it possible to ensure
> that the write cache is not disabled in some way (my server is situated
> in a data center, so there will probably never be loss of power).
>
> Or is the link above already outdated as newer ceph releases somehow
> deal with this problem? Or maybe a later Debian release (10) will handle
> the O_DSYNC flag differently?
>
> Perhaps I should simply invest in faster (and bigger) harddisks and
> forget the SSD-cluster idea?
>
> Thank you in advance for any help,
>
> Best Regards,
> Hermann
>
>
> --
> hermann@xxxxxxx
> PGP/GPG: 299893C7 (on keyservers)
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx