On 6/6/23 20:37, Andrew Murray wrote: > Hello, > > I've been working with an embedded video recording device that writes > data to the exfat filesystem of an SD card. 4 files (streams of video) > are written simultaneously by processes usually running on different > cores. > > If you assume a newly formatted SD card, then because exfat clusters > appear to be sequentially allocated as needed when data is written (in > userspace), then those clusters are interleaved across the 4 files. > When writeback occurs, you see a cluster of sequential writes for each > file but with gaps where clusters are used by other files (as if > writeback is happening per process at different times). This results > in a random access pattern. Ideally an IO scheduler would recognise > these are cooperating processes, merge these 4 clusters of activities > to result in a sequential write pattern and combine into fewer larger > writes. > > This is important for SD cards, because their write throughput is very > dependent on access patterns and size of write request. For example my > current SD card and above access pattern (with writes averaging 60KB) > results in a write throughput for a fully utilised device of less than > a few MB/S. This may seem contrary to the performance claims of SD > card manufacturers, but those claims are typically made for sequential > access with 512KB writes. Further, the claims made for the UHS speed > grades, e.g. U3 and the video class grades, e.g. V90 also assume that > specific SD card commands are used to enter a specific speed grade > mode (which isn't supported in Linux it seems). In other words larger > write accesses and more sequential access patterns will increase the > available bandwidth. (The only exception appears to be for the > application classes of SD cards which are optimised for random access > at 4KB). > > I've explored the various mq schedulers (i'm still learning) - though > I understand that to prevent software being a bottleneck for fast > devices each core (or is that process?) has its own queue. As a result > schedulers can't make decisions across those queues (as that defeats > the point of mq). Thus implying that in my use-case, where > "cooperating processes" are on separate cores, then there is no > capability for the scheduler to combine the interleaved writes (I > notice that bfq has logic to detect this, though not sure if it's for > reads or rotational devices). > > I've seen that mq-deadline appears to sort it's dispatch queue (I > understand a single queue for the device - so this is where those > software queues join) by sector - combined with the write_expire and > fifo_depth tunables - then it appears that mq-deadline does a good job > of turning interleaved writes to sequential writes (even across > processes running on different cores). However it doesn't appear to > combine writes which would greatly help. > > Many of the schedulers aim to achieve a maximum latency, but it feels maximum throughput... Maximizing latency is not something that anyone wants :) > like for slow devices, then a minimum write latency and ability to > reorder and combine those writes across cores would be beneficial. > I'm keen to understand if there is anything I've missed? Perhaps there > are tuneables or a specific scheduler that fits this purpose? Are my > assumptions about the mq layer correct? > > Does it make sense to add merging in the dispatch queue (within > mq-deadline), is this the right approach? Try with "none" scheduler, that is no scheduler at all. > > Thanks, > > Andrew Murray -- Damien Le Moal Western Digital Research