Difficulties pushing FIO towards many small files and WORM-style use-case

David Pineau <david.pineau@xxxxxxxxxxxxxxx> · Wed, 2 Dec 2020 15:31:45 +0100

Hello,

First of all, allow me to thank the FIO developers for providing this
very complete tool to benchmark storage setups.

In the context of my work, I'm trying to compare two storage setups
using FIO, to prepare for a hardware evolution of one of our services.

As the use-case is pretty much well understood, I was trying to
reproduce it in the FIO configuration file that you'll find later in
this email. To give you a bit more context, our usage of the hardware
is to have a Content Delivery Cache Software (homewritten) which
handles multiple layers of cache to distribute pieces of data whose
sizes range up from 16k to 10MB.
As we have data on the actual usage of this current software, we know
the spread of accesses to various size ranges, and we rely on a huge
number of files accessed by the multi-threaded service. As the pieces
of data can live a long time on this service and are immutable, I'm
trying to go for a WORM-style workload with FIO.

With this information in mind, I build the following FIO configuration file:

>>>>
[global]
# File-related config
directory=/mnt/test-mountpoint
nrfiles=3000
file_service_type=random
create_on_open=1
allow_file_create=1
filesize=16k-10m

# Io type config
rw=randrw
unified_rw_reporting=0
randrepeat=0
fallocate=none
end_fsync=0
overwrite=0
fsync_on_close=1
rwmixread=90
# In an attempt to reproduce a similar usage skew as our service...
# Spread IOs unevenly, skewed toward a part of the dataset:
# - 60% of IOs on 20% of data,
# - 20% of IOs on 30% of data,
# - 20% of IOs on 50% of data
random_distribution=zoned:60/20:20/30:20/50
# 100% Random reads, 0% Random writes (thus sequential)
percentage_random=100,0
# Likewise, configure different blocksizes for seq (write) & random (read) ops
bs_is_seq_rand=1
blocksize_range=128k-10m,
# Here's the blocksizes repartitions retrieved from our metrics during 3 hours
# Normally, it should be random within ranges, but this mode
# only uses fixed-size blocks, so we'll consider it good enough.
bssplit=,8k/10:16k/7:32k/9:64k/22:128k/21:256k/12:512k/14:1m/3:10m/2

# Threads/processes/job sync settings
thread=1

# IO/data Verify options
verify=null # Don't consume CPU please !

# Measurements and reporting settings
#per_job_logs=1
disk_util=1

# Io Engine config
ioengine=libaio

[cache-layer2]
# Jobs settings
time_based=1
runtime=60
numjobs=175
size=200M
<<<<<

With this configuration, I'm obligated to use the CLI option
"--alloc-size=256M" otherwise the preparatory memory allocation fails
and aborts.

That being said, despite this setting, I get the following issues that
I do not understand well enough to fix without your kind help:
 - OOM messages once the run starts
 - Setting "norandommap" does not seem to help, although I thought
that the memory issue was due to the "randommap" for my many files &
workers.
 - Impossible to increase the number of jobs/threads or files, because
once I do that. I get back to the memory pre-allocation issue, and no
amount of memory seems to fix it. (1g, 10g, etc..)
 - With these blockers, it seems impossible to push my current FIO
workload as far as saturating my hardware (which is my aim)
 - I observe that if I increase the settings of "size", "numjobs" or
"--alloc-size", the READ throughput measured by FIO goes down, while
the WRITE througput increases. I understand that increasing size for
seq write workload sincreases their throughput, but I'm at a loss in
front of the READ throughput behavior.

Do you have any advice on the configuration parameters I'm using to
push my hardware further towards its limits ?
Is there any mechanism within FIO that I'm misunderstanding, which is
causing me difficulty to do that ?

In advance, thank you for your kind advice and help,

--
David Pineau