Re: Slow random write access

Damien Le Moal <dlemoal@xxxxxxxxxx> · Wed, 7 Jun 2023 11:54:37 +0900

On 6/6/23 20:37, Andrew Murray wrote:
> Hello,
> 
> I've been working with an embedded video recording device that writes
> data to the exfat filesystem of an SD card. 4 files (streams of video)
> are written simultaneously by processes usually running on different
> cores.
> 
> If you assume a newly formatted SD card, then because exfat clusters
> appear to be sequentially allocated as needed when data is written (in
> userspace), then those clusters are interleaved across the 4 files.
> When writeback occurs, you see a cluster of sequential writes for each
> file but with gaps where clusters are used by other files (as if
> writeback is happening per process at different times). This results
> in a random access pattern. Ideally an IO scheduler would recognise
> these are cooperating processes, merge these 4 clusters of activities
> to result in a sequential write pattern and combine into fewer larger
> writes.
> 
> This is important for SD cards, because their write throughput is very
> dependent on access patterns and size of write request. For example my
> current SD card and above access pattern (with writes averaging 60KB)
> results in a write throughput for a fully utilised device of less than
> a few MB/S. This may seem contrary to the performance claims of SD
> card manufacturers, but those claims are typically made for sequential
> access with 512KB writes. Further, the claims made for the UHS speed
> grades, e.g. U3 and the video class grades, e.g. V90 also assume that
> specific SD card commands are used to enter a specific speed grade
> mode (which isn't supported in Linux it seems). In other words larger
> write accesses and more sequential access patterns will increase the
> available bandwidth. (The only exception appears to be for the
> application classes of SD cards which are optimised for random access
> at 4KB).
> 
> I've explored the various mq schedulers (i'm still learning) - though
> I understand that to prevent software being a bottleneck for fast
> devices each core (or is that process?) has its own queue. As a result
> schedulers can't make decisions across those queues (as that defeats
> the point of mq). Thus implying that in my use-case, where
> "cooperating processes" are on separate cores, then there is no
> capability for the scheduler to combine the interleaved writes (I
> notice that bfq has logic to detect this, though not sure if it's for
> reads or rotational devices).
> 
> I've seen that mq-deadline appears to sort it's dispatch queue (I
> understand a single queue for the device - so this is where those
> software queues join) by sector - combined with the write_expire and
> fifo_depth tunables - then it appears that mq-deadline does a good job
> of turning interleaved writes to sequential writes (even across
> processes running on different cores). However it doesn't appear to
> combine writes which would greatly help.
> 
> Many of the schedulers aim to achieve a maximum latency, but it feels

maximum throughput... Maximizing latency is not something that anyone wants :)

> like for slow devices, then a minimum write latency and ability to
> reorder and combine those writes across cores would be beneficial.
> I'm keen to understand if there is anything I've missed? Perhaps there
> are tuneables or a specific scheduler that fits this purpose? Are my
> assumptions about the mq layer correct?
> 
> Does it make sense to add merging in the dispatch queue (within
> mq-deadline), is this the right approach?

Try with "none" scheduler, that is no scheduler at all.

> 
> Thanks,
> 
> Andrew Murray

-- 
Damien Le Moal
Western Digital Research