Slow random write access

Andrew Murray <amurray@xxxxxxxxxxxxxxxxxxxx> · Tue, 6 Jun 2023 12:37:49 +0100

Hello,

I've been working with an embedded video recording device that writes
data to the exfat filesystem of an SD card. 4 files (streams of video)
are written simultaneously by processes usually running on different
cores.

If you assume a newly formatted SD card, then because exfat clusters
appear to be sequentially allocated as needed when data is written (in
userspace), then those clusters are interleaved across the 4 files.
When writeback occurs, you see a cluster of sequential writes for each
file but with gaps where clusters are used by other files (as if
writeback is happening per process at different times). This results
in a random access pattern. Ideally an IO scheduler would recognise
these are cooperating processes, merge these 4 clusters of activities
to result in a sequential write pattern and combine into fewer larger
writes.

This is important for SD cards, because their write throughput is very
dependent on access patterns and size of write request. For example my
current SD card and above access pattern (with writes averaging 60KB)
results in a write throughput for a fully utilised device of less than
a few MB/S. This may seem contrary to the performance claims of SD
card manufacturers, but those claims are typically made for sequential
access with 512KB writes. Further, the claims made for the UHS speed
grades, e.g. U3 and the video class grades, e.g. V90 also assume that
specific SD card commands are used to enter a specific speed grade
mode (which isn't supported in Linux it seems). In other words larger
write accesses and more sequential access patterns will increase the
available bandwidth. (The only exception appears to be for the
application classes of SD cards which are optimised for random access
at 4KB).

I've explored the various mq schedulers (i'm still learning) - though
I understand that to prevent software being a bottleneck for fast
devices each core (or is that process?) has its own queue. As a result
schedulers can't make decisions across those queues (as that defeats
the point of mq). Thus implying that in my use-case, where
"cooperating processes" are on separate cores, then there is no
capability for the scheduler to combine the interleaved writes (I
notice that bfq has logic to detect this, though not sure if it's for
reads or rotational devices).

I've seen that mq-deadline appears to sort it's dispatch queue (I
understand a single queue for the device - so this is where those
software queues join) by sector - combined with the write_expire and
fifo_depth tunables - then it appears that mq-deadline does a good job
of turning interleaved writes to sequential writes (even across
processes running on different cores). However it doesn't appear to
combine writes which would greatly help.

Many of the schedulers aim to achieve a maximum latency, but it feels
like for slow devices, then a minimum write latency and ability to
reorder and combine those writes across cores would be beneficial.

I'm keen to understand if there is anything I've missed? Perhaps there
are tuneables or a specific scheduler that fits this purpose? Are my
assumptions about the mq layer correct?

Does it make sense to add merging in the dispatch queue (within
mq-deadline), is this the right approach?

Thanks,

Andrew Murray