Accounting for partially filled blocks when modeling workloads

graf wasili <graf.wasili@xxxxxxxxxx> · Wed, 17 Feb 2021 10:36:57 +0100

Hello everybody,

Before everything else I wanted to say thanks to Jens and everybody who 
has put their time in this great project :)

I recently started using fio for benchmarking hard disks and possible 
storage setups and also learn more about the workings of the layers 
involved and now ran into a question.
I'm sorry if I'm missing something here - my knowledge of the topic is 
certainly lacking...  However, any help is appreciated.

On an ext4 file system two files cannot share the same block - so 
reading or writing a lot of small files introduces a considerable 
overhead wrt the actual amount of data retrieved from disk.
I want to take this into account when modelling workloads with fio, but 
issuing a command like

fio --name test --rw=randread --blocksize=4096 --size=1m --nrfiles=100 
---directory=path/on/ext4-formated/hard-disk/

will make fio read only 800KiB or 200 blocks. Each file occupies roughly 
2,5 blocks, so partially filled blocks occupied by each file are just 
ignored.

Looking into a possible solution i figured I could issue the parameter

bssplit=2048:n, 4096:n-1

with

n = 1 /  ceil[(file size) / (block size)]

to simulate the overhead for reads/writes of partially occupied blocks 
and get a meaningful throughput number for this kind of scenario 
(assuming that in reality, (file size) mod (block size) is uniformly 
distributed).

Does this make sense? Are there maybe better solutions?

Thanks
gw