Re: Question about delayed write IOs, octopus, mixed storage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 12/03/2021 07:05, Philip Brown wrote:
I'm running some tests with mixed storage units, and octopus.
8 nodes, each with 2 SSDs, and 8 HDDs .
the SSDsare relatively small: around 100GB each.

Im mapping 8 rbds, striping them together, and running fio on them for  testing.

# fio --filename=/...../fio.testfile --size=120GB --rw=randrw --bs=8k --direct=1 --ioengine=libaio  --iodepth=64 --numjobs=4 --time_based --group_reporting --name=readwritelatency-test-job --runtime=120 --eta-newline=1


Trouble is, I'm seeing sporadic delays of IOs.

When I test ZFS, for example, it has this neat wait clumping status check:
zpool iostat -w 20

and it shows me that some write io's are taking over 4 secomds to complete. Many are taking 1s or 2s


This kind of thing has sort of happened before(but previously, I think I was using SSDs exclusively). When I emailed the list, people suggested turning off RBD cache, which worked great in that situation.

This time, I have already done that (I believe), but still see this behavior.

Would folks have any further suggestions to smooth performance out?
The odd thing is, I read that bluestore is supposed to smooth things out and provide consistent response time, but that doesnt seem to be the case.



Sample output from the zpool iostat below:

twelve       total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
  .. snip ...

1ms         1.01K      0  1.00K      0      0      0      0      0      0      0
2ms            29      0     29     18      0      0      0      1      0      0
4ms            23      3     23     14      0      0      0      3      0      0
8ms            64      6     64      9      0      0      0      7      0      0
16ms           74     10     74     59      0      0      0     11      0      0
33ms           24     17     24    154      0      0      0     19      0      0
67ms            7     25      7    100      0      0      0     26      0      0
134ms           3     40      3     36      0      0      0     36      0      0
268ms           1     59      1     18      0      0      0     59      0      0
536ms           0    116      0      3      0      0      0    113      0      0
1s              0    109      0      0      0      0      0     98      0      0
2s              0     24      0      0      0      0      0     20      0      0
4s              0      2      0      0      0      0      0      1      0      0
8s              0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0







--
Philip Brown| Sr. Linux System Administrator | Medata, Inc.
5 Peters Canyon Rd Suite 250
Irvine CA 92606
Office 714.918.1310| Fax 714.918.1325
pbrown@xxxxxxxxxx| www.medata.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

First it is not a good idea to mix SSD/HDD OSDs in the same pool, in a real case you would create different pools for each type. In your case i do not think the 2 SSDs being mixed are affecting the outcome, it is probably better to think they are not there. I think the case you are see-ing is an incorrect/in-appropriate workload being tested on a cluster of 64 HDDs. I am pretty sure the disks have reached 100% saturation (% busy).

Your fio is queue depth 64 x 4 jobs = 256, each client write op will be multiplied by the number of replicas (default x3), multiplied by the amplification factor of OSD which includes a read and write ops to metadata rocksdb in addition the the write op (an external wal/db  on SSD will help) + depending on how you are mirroring your rbd disks the write streams could be further magnified at the device mapper layer. You are pushing your HDDs to their limit.

If you do need to support similar workload use SSD pool. If you do need to use HDDs, an external SSD wal/db is a must for even moderate workloads with small random block sizes.

The irregular "tail" latency you get is because the HDDs are saturated due load. From storage side you can protect against too much load (relative to your hardware) via limiting queue depth:
rbd map  --options queue_depth=XX

This will kind of overrule the depths you specify in your fio test so you it may not make much sense but it should produce less irregular latencies.

/maged
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux