Dear all, I found a partial solution to the problem and I also repeated a bit of testing, see below. # Client-sided solution, works for single-client IO The hard solution is to mount cephfs with the option "sync". This will translate any IO to direct IO and successfully throttle clients no matter how they perform IO. This will even work in multi-client set-ups. A somewhat less restrictive option is to set low values for vm.dirty_[background_]bytes to allow some buffered IO for small bursts. I tried with vm.dirty_background_bytes = 524288 vm.dirty_bytes = 1048576 and less restrictive vm.dirty_background_bytes = 2097152 vm.dirty_bytes = 67108864 (without sync mount option) and it seems to have the desired effect. It is possible to obtain good large-IO size throughput while limiting small IO size IOPs to a healthy level. Of course, this does not address destructive multi-client IO patterns, which must be addressed on the server side. # Test observations Today I repeated a shorter test to avoid crashing the cluster bad. We are in production and I don't have a test cluster. Therefore, if anyone could try this on a test cluster and check if the observations can be confirmed, that would be great. Here is a one-line command: fio -name=rand-write -directory=/mnt/cephfs/home/frans/fio -filename_format=tmp/fio-\$jobname-\$jobnum-\$filenum -rw=randwrite -bs=4K -numjobs=4 -time_based=1 -runtime=5 -filesize=100G -ioengine=sync -direct=0 -iodepth=1 Adjust runtime and numjobs to increasingly higher values to increase stress. In my original tests I observed OSD outages with numjobs=4 and runtime=30 already. Note that these occur several minutes after the fio command completes. Here are my today's observations with "osd_op_queue=wpq" and "osd_op_queue_cut_off=high" and a 5 sec run time: - High IOPs (>4kops) on the data pool come in two waves. - The first wave does not cause slow ops. - There is a phase of low activity. - A second wave starts and now slow meta data ops are reported by the MDS. Health level becomes warn. - The cluster crunches through the meta data ops for a minute or so and then settles. This is quite a long time considering a 5 secs burst. - OSDs did not go out, but this could be due to not running the test long enough. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx