Re: CEPH 16.2.x: disappointing I/O performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hm, generally ceph is mostly latency sensitive which would more translate
into IOPs limits rather than bandwidth. In a single threaded write scenario
your throughput is limited by the latency of the write path which is
generally network + OSD write path + disk. People have managed to get write
latencies under 1ms on all-flash setups but around 0.8ms seems the best you
can achieve which generally puts an upper limit of ~1200 IOPS on a single
threaded client if you do direct synchronized IO. But there shouldn't
really be much in the path that artificially limits bandwidth.

Bluestore does deferred writes only for small writes - which are the writes
that will hit the WAL, writes larger than that will hit the backing store
(i.e HDD) directly. I think the default is 32KB but I could be wrong.
Obviously even for small writes the WAL will eventually have to be flushed
so your longer term performance is still impacted by your HDD speed.
So that might be why for larger block sizes the throughput suffers since
they will hit the drives directly

It's been pointed out in the past that disabling the HDD write cache can
actually improve latency quite substantially (e.g.
https://ceph-users.ceph.narkive.com/UU9QMu9W/disabling-write-cache-on-sata-hdds-reduces-write-latency-7-times)
- might be worth a try


On Wed, 6 Oct 2021 at 10:07, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:

> I'm not sure, fio might be showing some bogus values in the summary, I'll
> check the readings again tomorrow.
>
> Another thing I noticed is that writes seem bandwidth-limited and don't
> scale well with block size and/or number of threads. I.e. one clients
> writes at about the same speed regardless of the benchmark settings. A
> person on reddit, where I asked this question as well, suggested that in a
> replicated pool writes and reads are handled by the primary PG, which would
> explain this write bandwidth limit.
>
> /Z
>
> On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, <christian.wuerdig@xxxxxxxxx>
> wrote:
>
>> Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
>> decent (as you also state) - the bandwidth automatically follows from that
>> so not sure what you're expecting?
>> I am a bit puzzled though - by my math 7k IOPS at 4k should only be
>> 27MiB/sec - not sure how the 120MiB/sec was achieved
>> The read benchmark seems in line with 13k IOPS at 4k making around
>> 52MiB/sec bandwidth which again is expected.
>>
>>
>> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:
>>
>>> Hi,
>>>
>>> I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
>>> and
>>> its performance is kind of disappointing. I would very much appreciate an
>>> advice and/or pointers :-)
>>>
>>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
>>>
>>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
>>> 384 GB RAM
>>> 2 x boot drives
>>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
>>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
>>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
>>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
>>>
>>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
>>> apparmor is disabled, energy-saving features are disabled. The network
>>> between the CEPH nodes is 40G, CEPH access network is 40G, the average
>>> latencies are < 0.15 ms. I've personally tested the network for
>>> throughput,
>>> latency and loss, and can tell that it's operating as expected and
>>> doesn't
>>> exhibit any issues at idle or under load.
>>>
>>> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
>>> smaller NVME drives in each node used as DB/WAL and each HDD allocated .
>>> ceph osd tree output:
>>>
>>> ID   CLASS  WEIGHT     TYPE NAME                STATUS  REWEIGHT  PRI-AFF
>>>  -1         288.37488  root default
>>> -13         288.37488      datacenter ste
>>> -14         288.37488          rack rack01
>>>  -7          96.12495              host ceph01
>>>   0    hdd    9.38680                  osd.0        up   1.00000  1.00000
>>>   1    hdd    9.38680                  osd.1        up   1.00000  1.00000
>>>   2    hdd    9.38680                  osd.2        up   1.00000  1.00000
>>>   3    hdd    9.38680                  osd.3        up   1.00000  1.00000
>>>   4    hdd    9.38680                  osd.4        up   1.00000  1.00000
>>>   5    hdd    9.38680                  osd.5        up   1.00000  1.00000
>>>   6    hdd    9.38680                  osd.6        up   1.00000  1.00000
>>>   7    hdd    9.38680                  osd.7        up   1.00000  1.00000
>>>   8    hdd    9.38680                  osd.8        up   1.00000  1.00000
>>>   9   nvme    5.82190                  osd.9        up   1.00000  1.00000
>>>  10   nvme    5.82190                  osd.10       up   1.00000  1.00000
>>> -10          96.12495              host ceph02
>>>  11    hdd    9.38680                  osd.11       up   1.00000  1.00000
>>>  12    hdd    9.38680                  osd.12       up   1.00000  1.00000
>>>  13    hdd    9.38680                  osd.13       up   1.00000  1.00000
>>>  14    hdd    9.38680                  osd.14       up   1.00000  1.00000
>>>  15    hdd    9.38680                  osd.15       up   1.00000  1.00000
>>>  16    hdd    9.38680                  osd.16       up   1.00000  1.00000
>>>  17    hdd    9.38680                  osd.17       up   1.00000  1.00000
>>>  18    hdd    9.38680                  osd.18       up   1.00000  1.00000
>>>  19    hdd    9.38680                  osd.19       up   1.00000  1.00000
>>>  20   nvme    5.82190                  osd.20       up   1.00000  1.00000
>>>  21   nvme    5.82190                  osd.21       up   1.00000  1.00000
>>>  -3          96.12495              host ceph03
>>>  22    hdd    9.38680                  osd.22       up   1.00000  1.00000
>>>  23    hdd    9.38680                  osd.23       up   1.00000  1.00000
>>>  24    hdd    9.38680                  osd.24       up   1.00000  1.00000
>>>  25    hdd    9.38680                  osd.25       up   1.00000  1.00000
>>>  26    hdd    9.38680                  osd.26       up   1.00000  1.00000
>>>  27    hdd    9.38680                  osd.27       up   1.00000  1.00000
>>>  28    hdd    9.38680                  osd.28       up   1.00000  1.00000
>>>  29    hdd    9.38680                  osd.29       up   1.00000  1.00000
>>>  30    hdd    9.38680                  osd.30       up   1.00000  1.00000
>>>  31   nvme    5.82190                  osd.31       up   1.00000  1.00000
>>>  32   nvme    5.82190                  osd.32       up   1.00000  1.00000
>>>
>>> ceph df:
>>>
>>> --- RAW STORAGE ---
>>> CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
>>> hdd    253 TiB  241 TiB  13 TiB    13 TiB       5.00
>>> nvme    35 TiB   35 TiB  82 GiB    82 GiB       0.23
>>> TOTAL  288 TiB  276 TiB  13 TiB    13 TiB       4.42
>>>
>>> --- POOLS ---
>>> POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX
>>> AVAIL
>>> images                 12  256   24 GiB    3.15k   73 GiB   0.03     76
>>> TiB
>>> volumes                13  256  839 GiB  232.16k  2.5 TiB   1.07     76
>>> TiB
>>> backups                14  256   31 GiB    8.56k   94 GiB   0.04     76
>>> TiB
>>> vms                    15  256  752 GiB  198.80k  2.2 TiB   0.96     76
>>> TiB
>>> device_health_metrics  16   32   35 MiB       39  106 MiB      0     76
>>> TiB
>>> volumes-nvme           17  256   28 GiB    7.21k   81 GiB   0.24     11
>>> TiB
>>> ec-volumes-meta        18  256   27 KiB        4   92 KiB      0     76
>>> TiB
>>> ec-volumes-data        19  256    8 KiB        1   12 KiB      0    152
>>> TiB
>>>
>>> Please disregard the ec-pools, as they're not currently in use. All other
>>> pools are configured with min_size=2, size=3. All pools are bound to HDD
>>> storage except for 'volumes-nvme', which is bound to NVME. The number of
>>> PGs was increased recently, as with autoscaler I was getting a very
>>> uneven
>>> PG distribution on devices and we're expecting to add 3 more nodes of
>>> exactly the same configuration in the coming weeks. I have to emphasize
>>> that I tested different PG numbers and they didn't have a noticeable
>>> impact
>>> on the cluster performance.
>>>
>>> The main issue is that this beautiful cluster isn't very fast. When I
>>> test
>>> against the 'volumes' pool, residing on HDD storage class (HDDs with
>>> DB/WAL
>>> on NVME), I get unexpectedly low throughput numbers:
>>>
>>> > rados -p volumes bench 30 write --no-cleanup
>>> ...
>>> Total time run:         30.3078
>>> Total writes made:      3731
>>> Write size:             4194304
>>> Object size:            4194304
>>> Bandwidth (MB/sec):     492.415
>>> Stddev Bandwidth:       161.777
>>> Max bandwidth (MB/sec): 820
>>> Min bandwidth (MB/sec): 204
>>> Average IOPS:           123
>>> Stddev IOPS:            40.4442
>>> Max IOPS:               205
>>> Min IOPS:               51
>>> Average Latency(s):     0.129115
>>> Stddev Latency(s):      0.143881
>>> Max latency(s):         1.35669
>>> Min latency(s):         0.0228179
>>>
>>> > rados -p volumes bench 30 seq --no-cleanup
>>> ...
>>> Total time run:       14.7272
>>> Total reads made:     3731
>>> Read size:            4194304
>>> Object size:          4194304
>>> Bandwidth (MB/sec):   1013.36
>>> Average IOPS:         253
>>> Stddev IOPS:          63.8709
>>> Max IOPS:             323
>>> Min IOPS:             91
>>> Average Latency(s):   0.0625202
>>> Max latency(s):       0.551629
>>> Min latency(s):       0.010683
>>>
>>> On average, I get around 550 MB/s writes and 800 MB/s reads with 16
>>> threads
>>> and 4MB blocks. The numbers don't look fantastic for this hardware, I can
>>> actually push over 8 GB/s of throughput with fio, 16 threads and 4MB
>>> blocks
>>> from an RBD client (KVM Linux VM) connected over a low-latency 40G
>>> network,
>>> probably hitting some OSD caches there:
>>>
>>>    READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s
>>> (61.7MB/s-1058MB/s),
>>> io=501GiB (538GB), run=60001-60153msec
>>> Disk stats (read/write):
>>>   vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092,
>>> util=99.48%
>>>
>>> The issue manifests when the same client does something closer to
>>> real-life
>>> usage, like a single-thread write or read with 4KB blocks, as if using
>>> for
>>> example ext4 file system:
>>>
>>> > fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1
>>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
>>> ...
>>> Run status group 0 (all jobs):
>>>   WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s),
>>> io=7694MiB (8067MB), run=64079-64079msec
>>> Disk stats (read/write):
>>>   vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216,
>>> util=77.31%
>>>
>>> > fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1
>>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
>>> ...
>>> Run status group 0 (all jobs):
>>>    READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s
>>> (56.7MB/s-56.7MB/s),
>>> io=3242MiB (3399MB), run=60001-60001msec
>>> Disk stats (read/write):
>>>   vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336, util=99.13%
>>>
>>> And this is a total disaster: the IOPS look decent, but the bandwidth is
>>> unexpectedly very very low. I just don't understand why a single RBD
>>> client
>>> writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad
>>> joke ¯\_(ツ)_/¯
>>>
>>> When I run these benchmarks, nothing seems to be overloaded, things like
>>> CPU and network are barely utilized, OSD latencies don't show anything
>>> unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs
>>> with DB/WAL on NVME drives should produce better I/O bandwidth, both for
>>> writes and reads. I mean, I can easily get much better performance from a
>>> single HDD shared over network via NFS or iSCSI.
>>>
>>> I am open to suggestions and would very much appreciate comments and/or
>>> an
>>> advice on how to improve the cluster performance.
>>>
>>> Best regards,
>>> Zakhar
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux