On 12/03/2021 07:05, Philip Brown wrote:
I'm running some tests with mixed storage units, and octopus.
8 nodes, each with 2 SSDs, and 8 HDDs .
the SSDsare relatively small: around 100GB each.
Im mapping 8 rbds, striping them together, and running fio on them for testing.
# fio --filename=/...../fio.testfile --size=120GB --rw=randrw --bs=8k --direct=1 --ioengine=libaio --iodepth=64 --numjobs=4 --time_based --group_reporting --name=readwritelatency-test-job --runtime=120 --eta-newline=1
Trouble is, I'm seeing sporadic delays of IOs.
When I test ZFS, for example, it has this neat wait clumping status check:
zpool iostat -w 20
and it shows me that some write io's are taking over 4 secomds to complete. Many are taking 1s or 2s
This kind of thing has sort of happened before(but previously, I think I was using SSDs exclusively). When I emailed the list, people suggested turning off RBD cache, which worked great in that situation.
This time, I have already done that (I believe), but still see this behavior.
Would folks have any further suggestions to smooth performance out?
The odd thing is, I read that bluestore is supposed to smooth things out and provide consistent response time, but that doesnt seem to be the case.
Sample output from the zpool iostat below:
twelve total_wait disk_wait syncq_wait asyncq_wait
latency read write read write read write read write scrub trim
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
.. snip ...
1ms 1.01K 0 1.00K 0 0 0 0 0 0 0
2ms 29 0 29 18 0 0 0 1 0 0
4ms 23 3 23 14 0 0 0 3 0 0
8ms 64 6 64 9 0 0 0 7 0 0
16ms 74 10 74 59 0 0 0 11 0 0
33ms 24 17 24 154 0 0 0 19 0 0
67ms 7 25 7 100 0 0 0 26 0 0
134ms 3 40 3 36 0 0 0 36 0 0
268ms 1 59 1 18 0 0 0 59 0 0
536ms 0 116 0 3 0 0 0 113 0 0
1s 0 109 0 0 0 0 0 98 0 0
2s 0 24 0 0 0 0 0 20 0 0
4s 0 2 0 0 0 0 0 1 0 0
8s 0 0 0 0 0 0 0 0 0 0
17s 0 0 0 0 0 0 0 0 0 0
--
Philip Brown| Sr. Linux System Administrator | Medata, Inc.
5 Peters Canyon Rd Suite 250
Irvine CA 92606
Office 714.918.1310| Fax 714.918.1325
pbrown@xxxxxxxxxx| www.medata.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
First it is not a good idea to mix SSD/HDD OSDs in the same pool, in a
real case you would create different pools for each type. In your case i
do not think the 2 SSDs being mixed are affecting the outcome, it is
probably better to think they are not there. I think the case you are
see-ing is an incorrect/in-appropriate workload being tested on a
cluster of 64 HDDs. I am pretty sure the disks have reached 100%
saturation (% busy).
Your fio is queue depth 64 x 4 jobs = 256, each client write op will be
multiplied by the number of replicas (default x3), multiplied by the
amplification factor of OSD which includes a read and write ops to
metadata rocksdb in addition the the write op (an external wal/db on
SSD will help) + depending on how you are mirroring your rbd disks the
write streams could be further magnified at the device mapper layer. You
are pushing your HDDs to their limit.
If you do need to support similar workload use SSD pool. If you do need
to use HDDs, an external SSD wal/db is a must for even moderate
workloads with small random block sizes.
The irregular "tail" latency you get is because the HDDs are saturated
due load. From storage side you can protect against too much load
(relative to your hardware) via limiting queue depth:
rbd map --options queue_depth=XX
This will kind of overrule the depths you specify in your fio test so
you it may not make much sense but it should produce less irregular
latencies.
/maged
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx