Re: Choosing suitable SSD for Ceph cluster

Georg Fleig <georg@xxxxxxxx> · Fri, 25 Oct 2019 09:30:12 +0200



    Hi,
    the Samsung PM1725b is definitely a good choice when
      it comes to "lower" price enterprise SSDs. They cost pretty much
      the same as the Samsung Pro SSDs but offer way higher DWPD and
      power loss protection.

    
    My benchmarks of the 3.2TB version in a PCIe 2.0 slot (the card
      is 3.0!)

    
    fio --filename=/dev/nvme0n1 --direct=1 --sync=1 --rw=write --bs=4k --numjobs=10 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
  write: IOPS=154k, BW=601MiB/s (630MB/s)(35.2GiB/60003msec)

fio --filename=/dev/nvme0n1 --direct=1 --sync=1 --rw=write --bs=4M --numjobs=5 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
  write: IOPS=679, BW=2717MiB/s (2849MB/s)(159GiB/60005msec)


Regards,
Georg 

     
    On 24.10.19 21:21, Martin Verges wrote:

    
      Hello,
        

        think about migrating to a way faster and better Ceph
          version and towards bluestore to increase the performance with
          the existing hardware.
        

        If you want to go with PCIe card, the Samsung PM1725b can
          provide quite good speeds but at much higher costs then the
          EVO. If you want to check drives, take a look at the uncached
          write latency. The lower the value is, the better will be the
          drive.
        

            --

              Martin Verges

              Managing director

              
              Mobile: +49 174 9335695

              E-Mail: martin.verges@xxxxxxxx

              Chat: https://t.me/MartinVerges

              
              croit GmbH, Freseniusstr. 31h, 81247 Munich

              CEO: Martin Verges - VAT-ID: DE310638492

              Com. register: Amtsgericht Munich HRB 231263

              
              Web: https://croit.io

              YouTube: https://goo.gl/PGE1Bx

            
        Am Do., 24. Okt. 2019 um
          21:09 Uhr schrieb Hermann Himmelbauer <hermann@xxxxxxx>:

        
        Hi,

          I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3)
          cluster on

          3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA
          harddisks),

          interconnected via Infiniband 40.

          
          Problem is that the ceph performance is quite bad (approx.
          30MiB/s

          reading, 3-4 MiB/s writing ), so I thought about plugging into
          each node

          a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea
          is to

          have a faster ceph storage and also some storage extension.

          
          The question is now which SSDs I should use. If I understand
          it right,

          not every SSD is suitable for ceph, as is denoted at the links
          below:

          
          https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

          or here:

          https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark

          
          In the first link, the Samsung SSD 950 PRO 512GB NVMe is
          listed as a

          fast SSD for ceph. As the 950 is not available anymore, I
          ordered a

          Samsung 970 1TB for testing, unfortunately, the "EVO" instead
          of PRO.

          
          Before equipping all nodes with these SSDs, I did some tests
          with "fio"

          as recommended, e.g. like this:

          
          fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write
          --bs=4k

          --numjobs=1 --iodepth=1 --runtime=60 --time_based
          --group_reporting

          --name=journal-test

          
          The results are as the following:

          
          -----------------------

          1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter

          Jobs: 1:

          read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec

          write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec

          
          Jobs: 4:

          read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec

          write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec

          
          Jobs: 10:

          read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec

          write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec

          -----------------------

          
          So the read speed is impressive, but the write speed is really
          bad.

          
          Therefore I ordered the Samsung 970 PRO (1TB) as it has faster
          NAND

          chips (MLC instead of TLC). The results are, however even
          worse for writing:

          
          -----------------------

          Samsung 970 PRO NVMe M.2 mit PCIe Adapter

          Jobs: 1:

          read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec

          write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec

          
          Jobs: 4:

          read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec

          write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec

          
          Jobs: 10:

          read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt=
          60001msec

          write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec

          -----------------------

          
          I did some research and found out, that the "--sync" flag sets
          the flag

          "O_DSYNC" which seems to disable the SSD cache which leads to
          these

          horrid write speeds.

          
          It seems that this relates to the fact that the write cache is
          only not

          disabled for SSDs which implement some kind of battery buffer
          that

          guarantees a data flush to the flash in case of a powerloss.

          
          However, It seems impossible to find out which SSDs do have
          this

          powerloss protection, moreover, these enterprise SSDs are
          crazy

          expensive compared to the SSDs above - moreover it's unclear
          if

          powerloss protection is even available in the NVMe form
          factor. So

          building a 1 or 2 TB cluster seems not really
          affordable/viable.

          
          So, can please anyone give me hints what to do? Is it possible
          to ensure

          that the write cache is not disabled in some way (my server is
          situated

          in a data center, so there will probably never be loss of
          power).

          
          Or is the link above already outdated as newer ceph releases
          somehow

          deal with this problem? Or maybe a later Debian release (10)
          will handle

          the O_DSYNC flag differently?

          
          Perhaps I should simply invest in faster (and bigger)
          harddisks and

          forget the SSD-cluster idea?

          
          Thank you in advance for any help,

          
          Best Regards,

          Hermann

          
          -- 

          hermann@xxxxxxx

          PGP/GPG: 299893C7 (on keyservers)

          _______________________________________________

          ceph-users mailing list -- ceph-users@xxxxxxx

          To unsubscribe send an email to ceph-users-leave@xxxxxxx

        
      _______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

    
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx