Re: Slow random write access

Andrew Murray <amurray@xxxxxxxxxxxxxxxxxxxx> · Wed, 7 Jun 2023 10:03:52 +0100

On Wed, 7 Jun 2023 at 03:54, Damien Le Moal <dlemoal@xxxxxxxxxx> wrote:
>
> On 6/6/23 20:37, Andrew Murray wrote:
> > Hello,
> >
> > I've been working with an embedded video recording device that writes
> > data to the exfat filesystem of an SD card. 4 files (streams of video)
> > are written simultaneously by processes usually running on different
> > cores.
> >
> > If you assume a newly formatted SD card, then because exfat clusters
> > appear to be sequentially allocated as needed when data is written (in
> > userspace), then those clusters are interleaved across the 4 files.
> > When writeback occurs, you see a cluster of sequential writes for each
> > file but with gaps where clusters are used by other files (as if
> > writeback is happening per process at different times). This results
> > in a random access pattern. Ideally an IO scheduler would recognise
> > these are cooperating processes, merge these 4 clusters of activities
> > to result in a sequential write pattern and combine into fewer larger
> > writes.
> >
> > This is important for SD cards, because their write throughput is very
> > dependent on access patterns and size of write request. For example my
> > current SD card and above access pattern (with writes averaging 60KB)
> > results in a write throughput for a fully utilised device of less than
> > a few MB/S. This may seem contrary to the performance claims of SD
> > card manufacturers, but those claims are typically made for sequential
> > access with 512KB writes. Further, the claims made for the UHS speed
> > grades, e.g. U3 and the video class grades, e.g. V90 also assume that
> > specific SD card commands are used to enter a specific speed grade
> > mode (which isn't supported in Linux it seems). In other words larger
> > write accesses and more sequential access patterns will increase the
> > available bandwidth. (The only exception appears to be for the
> > application classes of SD cards which are optimised for random access
> > at 4KB).
> >
> > I've explored the various mq schedulers (i'm still learning) - though
> > I understand that to prevent software being a bottleneck for fast
> > devices each core (or is that process?) has its own queue. As a result
> > schedulers can't make decisions across those queues (as that defeats
> > the point of mq). Thus implying that in my use-case, where
> > "cooperating processes" are on separate cores, then there is no
> > capability for the scheduler to combine the interleaved writes (I
> > notice that bfq has logic to detect this, though not sure if it's for
> > reads or rotational devices).
> >
> > I've seen that mq-deadline appears to sort it's dispatch queue (I
> > understand a single queue for the device - so this is where those
> > software queues join) by sector - combined with the write_expire and
> > fifo_depth tunables - then it appears that mq-deadline does a good job
> > of turning interleaved writes to sequential writes (even across
> > processes running on different cores). However it doesn't appear to
> > combine writes which would greatly help.
> >
> > Many of the schedulers aim to achieve a maximum latency, but it feels
>
> maximum throughput... Maximizing latency is not something that anyone wants :)

I didn't phrase that well, I was referring to the ability to specify a
worse case latency, for example the write-expire tuneable in
mq-deadline and the fifo_expire_sync tuneable in bfq. I guess there is
often a tradeoff between throughput and latency (for example the
low_latency mode of bfq) - if you can hold on to IO requests for
longer, then you may have a better ability to reorder and combine them
which can improve throughput on devices that are sensitive to IO size
and ordering.

>
> > like for slow devices, then a minimum write latency and ability to
> > reorder and combine those writes across cores would be beneficial.
> > I'm keen to understand if there is anything I've missed? Perhaps there
> > are tuneables or a specific scheduler that fits this purpose? Are my
> > assumptions about the mq layer correct?
> >
> > Does it make sense to add merging in the dispatch queue (within
> > mq-deadline), is this the right approach?
>
> Try with "none" scheduler, that is no scheduler at all.

This won't improve throughput for the use-case I described.

As I understand, with scheduler "none", then the software queues
(associated with each core) will get dispatched directly to the single
(in my case) queue for the device. The block layer may perform very
simple merges on entry to the queues - but without an IO scheduler
nothing more advanced will be performed. As writes are interleaved
across queues then this merging is ineffective. This results in small
interleaved writes - I'm looking for a way to coax Linux to reorder
(which I can see the mq-deadline scheduler does in the context of a
single hardware queue) and combine the reordered requests. With the
view that this will result in more sequential writes of a larger block
size (which most SD cards are happier with).

Thanks,

Andrew Murray

>
> >
> > Thanks,
> >
> > Andrew Murray
>
> --
> Damien Le Moal
> Western Digital Research
>