Re: Expected IO in luminous Ceph Cluster

John Petrini <jpetrini@xxxxxxxxxxxx> · Tue, 11 Jun 2019 08:04:04 -0400

I certainly would, particularly on your SSD's. I'm not familiar with
those Toshibas but disabling disk cache has improved performance on my
clusters and others on this list.

Does the LSI controller you're using provide read/write cache and do
you have it enabled? 72k spinners are likely to see a huge performance
gain from controller cache, especially in regards to latency. Only
enable caching if the controller has a battery and make sure to enable
force write-through in the event that the battery fails. If your
controller doesn't have cache you may want to seriously consider
upgrading to controllers that do otherwise those 72k disks are going
to be a major limiting factor in terms of performance.

Regarding your db partition, the latest advice seems to be that your
db should be 2x the biggest layer (at least 60GB) to avoid spillover
to the OSD during compaction. See:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg54628.html.
With 72k disks you'll want to avoid small writes hitting them directly
if possible, especially if you have no controller cache.

It would be useful to see iowait on your cluster. iostat -x 2 and let
it run for a few cycles while the cluster is busy. If there's high
iowait on your SSD's disabling disk cache may show an improvement. If
there's high iowait on the HDD's, controller cache and/or increasing
your db size may help.

John Petrini
Platforms Engineer
215.297.4400 x 232
www.coredial.com
751 Arbor Way, Hillcrest I, Suite 150 Blue Bell, PA 19422
The information transmitted is intended only for the person or entity
to which it is addressed and may contain confidential and/or
privileged material. Any review, retransmission, dissemination or
other use of, or taking of any action in reliance upon, this
information by persons or entities other than the intended recipient
is prohibited. If you received this in error, please contact the
sender and delete the material from any computer.

On Tue, Jun 11, 2019 at 3:35 AM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote:
>
> Hi John,
>
> I have 9 HDDs and 3 SSDs behind a SAS3008 PCI-Express Fusion-MPT SAS-3 from LSI. HDDs are HGST HUH721008AL (8TB, 7200k rpm), SSDs are Toshiba PX05SMB040 (400GB). OSDs are bluestore format, 3 HDDs have their wal and db on one SSD (DB Size 50GB, wal 10 GB). I did not change any cache settings.
>
> I disabled cstates which improved performance slightly. Do you suggest to turn off caching on disks?
>
> Regards
> Felix
>
> -------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> -------------------------------------------------------------------------------------
> -------------------------------------------------------------------------------------
>
>
> Von: John Petrini <jpetrini@xxxxxxxxxxxx>
> Datum: Freitag, 7. Juni 2019 um 15:49
> An: "Stolte, Felix" <f.stolte@xxxxxxxxxxxxx>
> Cc: Sinan Polat <sinan@xxxxxxxx>, ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Betreff: Re:  Expected IO in luminous Ceph Cluster
>
> How's iowait look on your disks?
>
> How have you configured your disks and what are your cache settings?
>
> Did you disable cstates?
>
> On Friday, June 7, 2019, Stolte, Felix <mailto:f.stolte@xxxxxxxxxxxxx> wrote:
> > Hi Sinan,
> >
> > thanks for the numbers. I am a little bit surprised that your SSD pool has nearly the same stats as you SAS pool.
> >
> > Nevertheless I would expect our pools to perform like your SAS pool, at least regarding to writes since all our write ops should be placed on our SSDs. But since I only achieve 10% of your numbers I need to figure out my bottle neck. For now I have no clue. According to our monitoring system network bandwith, ram or cpu usage is even close to be saturated.
> >
> > Could someone advice me on where to look?
> >
> > Regards Felix
> > -------------------------------------------------------------------------------------
> > -------------------------------------------------------------------------------------
> > Forschungszentrum Juelich GmbH
> > 52425 Juelich
> > Sitz der Gesellschaft: Juelich
> > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> > Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> > Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> > Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> > Prof. Dr. Sebastian M. Schmidt
> > -------------------------------------------------------------------------------------
> > -------------------------------------------------------------------------------------
> >
> >
> > Am 07.06.19, 13:33 schrieb "Sinan Polat" <mailto:sinan@xxxxxxxx>:
> >
> >     Hi Felix,
> >
> >     I have 2 Pools, a SSD only and a SAS only pool.
> >
> >     SSD pool is spread over 12 OSD servers.
> >     SAS pool is spread over 6 OSD servers.
> >
> >
> >     See results (SSD Only Pool):
> >
> >     # sysbench --file-fsync-freq=1 --threads=16 fileio --file-total-size=1G
> >     --file-test-mode=rndrw --file-rw-ratio=2 run
> >     sysbench 1.0.17 (using system LuaJIT 2.0.4)
> >
> >     Running the test with following options:
> >     Number of threads: 16
> >     Initializing random number generator from current time
> >
> >
> >     Extra file open flags: (none)
> >     128 files, 8MiB each
> >     1GiB total file size
> >     Block size 16KiB
> >     Number of IO requests: 0
> >     Read/Write ratio for combined random IO test: 2.00
> >     Periodic FSYNC enabled, calling fsync() each 1 requests.
> >     Calling fsync() at the end of test, Enabled.
> >     Using synchronous I/O mode
> >     Doing random r/w test
> >     Initializing worker threads...
> >
> >     Threads started!
> >
> >
> >     File operations:
> >         reads/s:                      508.38
> >         writes/s:                     254.19
> >         fsyncs/s:                     32735.14
> >
> >     Throughput:
> >         read, MiB/s:                  7.94
> >         written, MiB/s:               3.97
> >
> >     General statistics:
> >         total time:                          10.0103s
> >         total number of events:              333336
> >
> >     Latency (ms):
> >              min:                                    0.00
> >              avg:                                    0.48
> >              max:                                   10.18
> >              95th percentile:                        2.11
> >              sum:                               159830.07
> >
> >     Threads fairness:
> >         events (avg/stddev):           20833.5000/335.70
> >         execution time (avg/stddev):   9.9894/0.00
> >     #
> >
> >     See results (SAS Only Pool):
> >     # sysbench --file-fsync-freq=1 --threads=16 fileio --file-total-size=1G
> >     --file-test-mode=rndrw --file-rw-ratio=2 run
> >     sysbench 1.0.17 (using system LuaJIT 2.0.4)
> >
> >     Running the test with following options:
> >     Number of threads: 16
> >     Initializing random number generator from current time
> >
> >
> >     Extra file open flags: (none)
> >     128 files, 8MiB each
> >     1GiB total file size
> >     Block size 16KiB
> >     Number of IO requests: 0
> >     Read/Write ratio for combined random IO test: 2.00
> >     Periodic FSYNC enabled, calling fsync() each 1 requests.
> >     Calling fsync() at the end of test, Enabled.
> >     Using synchronous I/O mode
> >     Doing random r/w test
> >     Initializing worker threads...
> >
> >     Threads started!
> >
> >
> >     File operations:
> >         reads/s:                      490.11
> >         writes/s:                     245.10
> >         fsyncs/s:                     31565.00
> >
> >     Throughput:
> >         read, MiB/s:                  7.66
> >         written, MiB/s:               3.83
> >
> >     General statistics:
> >         total time:                          10.0143s
> >         total number of events:              321477
> >
> >     Latency (ms):
> >              min:                                    0.00
> >              avg:                                    0.50
> >              max:                                   20.50
> >              95th percentile:                        2.30
> >              sum:                               159830.82
> >
> >     Threads fairness:
> >         events (avg/stddev):           20092.3125/186.66
> >         execution time (avg/stddev):   9.9894/0.00
> >     #
> >
> >
> >     Kind regards,
> >     Sinan Polat
> >
> >
> >
> >     > Op 7 juni 2019 om 12:47 schreef "Stolte, Felix" <mailto:f.stolte@xxxxxxxxxxxxx>:
> >     >
> >     >
> >     > Hi Sinan,
> >     >
> >     > that would be great. The numbers should differ a lot, since you have an all
> >     > flash pool, but it would be interesting, what we could expect from such a
> >     > configuration.
> >     >
> >     > Regards
> >     > Felix
> >     >
> >     > -------------------------------------------------------------------------------------
> >     > -------------------------------------------------------------------------------------
> >     > Forschungszentrum Juelich GmbH
> >     > 52425 Juelich
> >     > Sitz der Gesellschaft: Juelich
> >     > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> >     > Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> >     > Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> >     > Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> >     > Prof. Dr. Sebastian M. Schmidt
> >     > -------------------------------------------------------------------------------------
> >     > -------------------------------------------------------------------------------------
> >     >
> >     >
> >     > Am 07.06.19, 12:02 schrieb "Sinan Polat" <mailto:sinan@xxxxxxxx>:
> >     >
> >     >     Hi Felix,
> >     >
> >     >     I can run your commands inside an OpenStack VM. Tthe storage cluster
> >     > contains of 12 OSD servers, holding each 8x 960GB SSD. Luminous FileStore.
> >     > Replicated 3.
> >     >
> >     >     Would it help you to run your command on my cluster?
> >     >
> >     >     Sinan
> >     >
> >     >     > Op 7 jun. 2019 om 08:52 heeft Stolte, Felix <mailto:f.stolte@xxxxxxxxxxxxx> het
> >     >     > volgende geschreven:
> >     >     >
> >     >     > I have no performance data before we migrated to bluestore. You should
> >     >     > start a separate topic regarding your question.
> >     >     >
> >     >     > Could anyone with an more or less equally sized cluster post the output
> >     >     > of a sysbench with the following parameters (either from inside an
> >     >     > openstack vm or a mounted rbd)?
> >     >     >
> >     >     > sysbench --file-fsync-freq=1 --threads=16 fileio --file-total-size=1G
> >     >     >    --file-test-mode=rndrw --file-rw-ratio=2 prepare
> >     >     >
> >     >     > sysbench --file-fsync-freq=1 --threads=16 fileio --file-total-size=1G
> >     >     >    --file-test-mode=rndrw --file-rw-ratio=2 run
> >     >     >
> >     >     > Thanks in advance.
> >     >     >
> >     >     > Regards
> >     >     > Felix
> >     >     >
> >     >     > -------------------------------------------------------------------------------------
> >     >     > -------------------------------------------------------------------------------------
> >     >     > Forschungszentrum Juelich GmbH
> >     >     > 52425 Juelich
> >     >     > Sitz der Gesellschaft: Juelich
> >     >     > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> >     >     > Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> >     >     > Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> >     >     > Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> >     >     > Prof. Dr. Sebastian M. Schmidt
> >     >     > -------------------------------------------------------------------------------------
> >     >     > -------------------------------------------------------------------------------------
> >     >     >
> >     >     >
> >     >     > Am 06.06.19, 15:09 schrieb "Marc Roos" <mailto:M.Roos@xxxxxxxxxxxxxxxxx>:
> >     >     >
> >     >     >
> >     >     >    I am also thinking of moving the wal/db to ssd of the sata hdd's. Did
> >     >     >
> >     >     >    you do tests before and after this change, and know what the
> >     >     > difference
> >     >     >    is iops? And is the advantage more or less when your sata hdd's are
> >     >     >    slower?
> >     >     >
> >     >     >
> >     >     >    -----Original Message-----
> >     >     >    From: Stolte, Felix [mailto:mailto:f.stolte@xxxxxxxxxxxxx]
> >     >     >    Sent: donderdag 6 juni 2019 10:47
> >     >     >    To: ceph-users
> >     >     >    Subject:  Expected IO in luminous Ceph Cluster
> >     >     >
> >     >     >    Hello folks,
> >     >     >
> >     >     >    we are running a ceph cluster on Luminous consisting of 21 OSD Nodes
> >     >     >    with 9 8TB SATA drives and 3 Intel 3700 SSDs for Bluestore WAL and DB
> >     >     >
> >     >     >    (1:3 Ratio). OSDs have 10Gb for Public and Cluster Network. The
> >     >     > cluster
> >     >     >    is running stable for over a year. We didn’t had a closer look on IO
> >     >     >    until one of our customers started to complain about a VM we migrated
> >     >     >
> >     >     >    from VMware with Netapp Storage to our Openstack Cloud with ceph
> >     >     >    storage. He sent us a sysbench report from the machine, which I could
> >     >     >
> >     >     >    reproduce on other VMs as well as on a mounted RBD on physical
> >     >     > hardware:
> >     >     >
> >     >     >    sysbench --file-fsync-freq=1 --threads=16 fileio --file-total-size=1G
> >     >     >
> >     >     >    --file-test-mode=rndrw --file-rw-ratio=2 run sysbench 1.0.11 (using
> >     >     >    system LuaJIT 2.1.0-beta3)
> >     >     >
> >     >     >    Running the test with following options:
> >     >     >    Number of threads: 16
> >     >     >    Initializing random number generator from current time
> >     >     >
> >     >     >    Extra file open flags: 0
> >     >     >    128 files, 8MiB each
> >     >     >    1GiB total file size
> >     >     >    Block size 16KiB
> >     >     >    Number of IO requests: 0
> >     >     >    Read/Write ratio for combined random IO test: 2.00 Periodic FSYNC
> >     >     >    enabled, calling fsync() each 1 requests.
> >     >     >    Calling fsync() at the end of test, Enabled.
> >     >     >    Using synchronous I/O mode
> >     >     >    Doing random r/w test
> >     >     >
> >     >     >    File operations:
> >     >     >        reads/s:                      36.36
> >     >     >        writes/s:                     18.18
> >     >     >        fsyncs/s:                     2318.59
> >     >     >
> >     >     >    Throughput:
> >     >     >        read, MiB/s:                  0.57
> >     >     >        written, MiB/s:               0.28
> >     >     >
> >     >     >    General statistics:
> >     >     >        total time:                          10.0071s
> >     >     >        total number of events:              23755
> >     >     >
> >     >     >    Latency (ms):
> >     >     >             min:                                  0.01
> >     >     >             avg:                                  6.74
> >     >     >             max:                               1112.58
> >     >     >             95th percentile:                     26.68
> >     >     >             sum:                             160022.67
> >     >     >
> >     >     >    Threads fairness:
> >     >     >        events (avg/stddev):           1484.6875/52.59
> >     >     >        execution time (avg/stddev):   10.0014/0.00
> >     >     >
> >     >     >    Are these numbers reasonable for a cluster of our size?
> >     >     >
> >     >     >    Best regards
> >     >     >    Felix
> >     >     >    IT-Services
> >     >     >    Telefon 02461 61-9243
> >     >     >    E-Mail: mailto:f.stolte@xxxxxxxxxxxxx
> >     >     >
> >     >     >   ------------------------------------------------------------------------
> >     >     >    -------------
> >     >     >
> >     >     >   ------------------------------------------------------------------------
> >     >     >    -------------
> >     >     >    Forschungszentrum Juelich GmbH
> >     >     >    52425 Juelich
> >     >     >    Sitz der Gesellschaft: Juelich
> >     >     >    Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> >     >     >    Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> >     >     >    Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> >     >     >
> >     >     >    Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> >     >     > Prof.
> >     >     >    Dr. Sebastian M. Schmidt
> >     >     >
> >     >     >   ------------------------------------------------------------------------
> >     >     >    -------------
> >     >     >
> >     >     >   ------------------------------------------------------------------------
> >     >     >    -------------
> >     >     >
> >     >     >
> >     >     >    _______________________________________________
> >     >     >    ceph-users mailing list
> >     >     >    mailto:ceph-users@xxxxxxxxxxxxxx
> >     >     >    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     > _______________________________________________
> >     >     > ceph-users mailing list
> >     >     > mailto:ceph-users@xxxxxxxxxxxxxx
> >     >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     >
> >     >
> >     >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > mailto:ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
>
> John Petrini
> Platforms Engineer
>
>
>
> tel:215-297-4400
>
> https://coredial.com/
>
>
>
>
> https://www.google.com/maps/place/CoreDial,+LLC/@40.140902,-75.2878857,17z/data=!3m1!4b1!4m5!3m4!1s0x89c6bc587f1cfd47:0x4c79d505f2ee580b!8m2!3d40.140902!4d-75.285697
>
> The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
>
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com