Re: Question about delayed write IOs, octopus, mixed storage

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Fri, 12 Mar 2021 10:48:14 +0200

On 12/03/2021 07:05, Philip Brown wrote:
I'm running some tests with mixed storage units, and octopus.
8 nodes, each with 2 SSDs, and 8 HDDs .
the SSDsare relatively small: around 100GB each.

Im mapping 8 rbds, striping them together, and running fio on them for  testing.

# fio --filename=/...../fio.testfile --size=120GB --rw=randrw --bs=8k --direct=1 --ioengine=libaio  --iodepth=64 --numjobs=4 --time_based --group_reporting --name=readwritelatency-test-job --runtime=120 --eta-newline=1

Trouble is, I'm seeing sporadic delays of IOs.

When I test ZFS, for example, it has this neat wait clumping status check:
zpool iostat -w 20

and it shows me that some write io's are taking over 4 secomds to complete. Many are taking 1s or 2s

This kind of thing has sort of happened before(but previously, I think I was using SSDs exclusively). When I emailed the list, people suggested turning off RBD cache, which worked great in that situation.

This time, I have already done that (I believe), but still see this behavior.

Would folks have any further suggestions to smooth performance out?
The odd thing is, I read that bluestore is supposed to smooth things out and provide consistent response time, but that doesnt seem to be the case.

Sample output from the zpool iostat below:

twelve       total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
  .. snip ...

1ms         1.01K      0  1.00K      0      0      0      0      0      0      0
2ms            29      0     29     18      0      0      0      1      0      0
4ms            23      3     23     14      0      0      0      3      0      0
8ms            64      6     64      9      0      0      0      7      0      0
16ms           74     10     74     59      0      0      0     11      0      0
33ms           24     17     24    154      0      0      0     19      0      0
67ms            7     25      7    100      0      0      0     26      0      0
134ms           3     40      3     36      0      0      0     36      0      0
268ms           1     59      1     18      0      0      0     59      0      0
536ms           0    116      0      3      0      0      0    113      0      0
1s              0    109      0      0      0      0      0     98      0      0
2s              0     24      0      0      0      0      0     20      0      0
4s              0      2      0      0      0      0      0      1      0      0
8s              0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0

--
Philip Brown| Sr. Linux System Administrator | Medata, Inc.
5 Peters Canyon Rd Suite 250
Irvine CA 92606
Office 714.918.1310| Fax 714.918.1325
pbrown@xxxxxxxxxx| www.medata.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

First it is not a good idea to mix SSD/HDD OSDs in the same pool, in a 
real case you would create different pools for each type. In your case i 
do not think the 2 SSDs being mixed are affecting the outcome, it is 
probably better to think they are not there. I think the case you are 
see-ing is an incorrect/in-appropriate workload being tested on a 
cluster of 64 HDDs. I am pretty sure the disks have reached 100% 
saturation (% busy).

Your fio is queue depth 64 x 4 jobs = 256, each client write op will be 
multiplied by the number of replicas (default x3), multiplied by the 
amplification factor of OSD which includes a read and write ops to 
metadata rocksdb in addition the the write op (an external wal/db  on 
SSD will help) + depending on how you are mirroring your rbd disks the 
write streams could be further magnified at the device mapper layer. You 
are pushing your HDDs to their limit.

If you do need to support similar workload use SSD pool. If you do need 
to use HDDs, an external SSD wal/db is a must for even moderate 
workloads with small random block sizes.

The irregular "tail" latency you get is because the HDDs are saturated 
due load. From storage side you can protect against too much load 
(relative to your hardware) via limiting queue depth:
rbd map  --options queue_depth=XX

This will kind of overrule the depths you specify in your fio test so 
you it may not make much sense but it should produce less irregular 
latencies.

/maged
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx