Re: ceph fs crashes on simple fio test

Frank Schilder <frans@xxxxxx> · Fri, 20 Sep 2019 08:48:15 +0000

Dear all,

I found a partial solution to the problem and I also repeated a bit of testing, see below.

# Client-sided solution, works for single-client IO

The hard solution is to mount cephfs with the option "sync". This will translate any IO to direct IO and successfully throttle clients no matter how they perform IO. This will even work in multi-client set-ups. A somewhat less restrictive option is to set low values for vm.dirty_[background_]bytes to allow some buffered IO for small bursts. I tried with

vm.dirty_background_bytes = 524288
vm.dirty_bytes = 1048576

and less restrictive

vm.dirty_background_bytes = 2097152
vm.dirty_bytes = 67108864

(without sync mount option) and it seems to have the desired effect. It is possible to obtain good large-IO size throughput while limiting small IO size IOPs to a healthy level. Of course, this does not address destructive multi-client IO patterns, which must be addressed on the server side.

# Test observations

Today I repeated a shorter test to avoid crashing the cluster bad. We are in production and I don't have a test cluster. Therefore, if anyone could try this on a test cluster and check if the observations can be confirmed, that would be great.

Here is a one-line command:

fio -name=rand-write -directory=/mnt/cephfs/home/frans/fio -filename_format=tmp/fio-\$jobname-\$jobnum-\$filenum -rw=randwrite -bs=4K -numjobs=4 -time_based=1 -runtime=5 -filesize=100G -ioengine=sync -direct=0 -iodepth=1

Adjust runtime and numjobs to increasingly higher values to increase stress. In my original tests I observed OSD outages with numjobs=4 and runtime=30 already. Note that these occur several minutes after the fio command completes. Here are my today's observations with "osd_op_queue=wpq" and "osd_op_queue_cut_off=high" and a 5 sec run time:

- High IOPs (>4kops) on the data pool come in two waves.
- The first wave does not cause slow ops.
- There is a phase of low activity.
- A second wave starts and now slow meta data ops are reported by the MDS. Health level becomes warn.
- The cluster crunches through the meta data ops for a minute or so and then settles. This is quite a long time considering a 5 secs burst.
- OSDs did not go out, but this could be due to not running the test long enough.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx