Hello, I've been working with an embedded video recording device that writes data to the exfat filesystem of an SD card. 4 files (streams of video) are written simultaneously by processes usually running on different cores. If you assume a newly formatted SD card, then because exfat clusters appear to be sequentially allocated as needed when data is written (in userspace), then those clusters are interleaved across the 4 files. When writeback occurs, you see a cluster of sequential writes for each file but with gaps where clusters are used by other files (as if writeback is happening per process at different times). This results in a random access pattern. Ideally an IO scheduler would recognise these are cooperating processes, merge these 4 clusters of activities to result in a sequential write pattern and combine into fewer larger writes. This is important for SD cards, because their write throughput is very dependent on access patterns and size of write request. For example my current SD card and above access pattern (with writes averaging 60KB) results in a write throughput for a fully utilised device of less than a few MB/S. This may seem contrary to the performance claims of SD card manufacturers, but those claims are typically made for sequential access with 512KB writes. Further, the claims made for the UHS speed grades, e.g. U3 and the video class grades, e.g. V90 also assume that specific SD card commands are used to enter a specific speed grade mode (which isn't supported in Linux it seems). In other words larger write accesses and more sequential access patterns will increase the available bandwidth. (The only exception appears to be for the application classes of SD cards which are optimised for random access at 4KB). I've explored the various mq schedulers (i'm still learning) - though I understand that to prevent software being a bottleneck for fast devices each core (or is that process?) has its own queue. As a result schedulers can't make decisions across those queues (as that defeats the point of mq). Thus implying that in my use-case, where "cooperating processes" are on separate cores, then there is no capability for the scheduler to combine the interleaved writes (I notice that bfq has logic to detect this, though not sure if it's for reads or rotational devices). I've seen that mq-deadline appears to sort it's dispatch queue (I understand a single queue for the device - so this is where those software queues join) by sector - combined with the write_expire and fifo_depth tunables - then it appears that mq-deadline does a good job of turning interleaved writes to sequential writes (even across processes running on different cores). However it doesn't appear to combine writes which would greatly help. Many of the schedulers aim to achieve a maximum latency, but it feels like for slow devices, then a minimum write latency and ability to reorder and combine those writes across cores would be beneficial. I'm keen to understand if there is anything I've missed? Perhaps there are tuneables or a specific scheduler that fits this purpose? Are my assumptions about the mq layer correct? Does it make sense to add merging in the dispatch queue (within mq-deadline), is this the right approach? Thanks, Andrew Murray