Re: Small HDD cluster, switch from Bluestore to Filestore

vitalif@xxxxxxxxxx · Fri, 16 Aug 2019 12:38:45 +0300

Hi.

It's not Ceph to blame!

Linux does not support cached asynchronous I/O, except for the new 
io-uring! I.e. it supports aio calls, but they just block when you're 
trying to do them on an FD opened without O_DIRECT.

So basically what happens when you benchmark it with -ioengine=libaio 
-direct=0 is that it's turning into SINGLE-THREADED I/O!

Of course the single-threaded performance is worse.

Hi Everyone,
There's been a few threads around about small HDD (spinning disk)
clusters and performance on Bluestore.
One recently from Christian
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036385.html)
was particularly interesting to us as we have a very similar setup to
what Christian has and we see similar performance.

We have a 6 node cluster each with 12x 4TB SATA HDD, IT mode LSI 3008, 
wal/db on
33GB NVMe partitions. Each node has a single Xeon Gold 6132 CPU @
2.60GHz and dual 10GB network.
We also use bcache with 1 180GB NVMe partition shared between 6 osd's.
Workload is via KVM (Proxmox)

I did the same benchmark fio tests as Christian. Here's my results (M
for me, C for Christian)
direct=0
========
M -- read : io=6008.0MB, bw=203264KB/s, iops=49, runt= 30267msec
C -- read: IOPS=40, BW=163MiB/s (171MB/s)(7556MiB/46320msec)

direct=1
========
M -- read : io=32768MB, bw=1991.4MB/s, iops=497, runt= 16455ms
C -- read: IOPS=314, BW=1257MiB/s (1318MB/s)(32.0GiB/26063msec)

direct=0
========
M -- write: io=32768MB, bw=471105KB/s, iops=115, runt= 71225msec
C -- write: IOPS=119, BW=479MiB/s (503MB/s)(32.0GiB/68348msec

direct=1
========
M -- write: io=32768MB, bw=479829KB/s, iops=117, runt= 69930msec
C -- write: IOPS=139, BW=560MiB/s (587MB/s)(32.0GiB/58519msec)

I should probably mention that there was some active workload on the
cluster at that time also, around 500iops write and 100MB/s
throughput.
The main problem that we're having with this cluster is how easy it is
for it to hit slow requests and we have one particular vm that ends up
doing scsi resets because of the latency.

So we're considering switching these osd's to filestore.
We have two other clusters using filestore/bcache/ssd journal and the
performance seems to be much better on those - taking into account the
different sizes.
What are peoples thoughts on this size cluster? Is it just not a good
fit with bluestore and our type of workload?
Also, does anyone have any knowledge on future support for filestore?
I'm concerned that we may have to migrate our other clusters off
filestore sometime in the future and that'll hurt us with the current
performance.

Rich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx