In my environment I have large, sequential I/O requests (e.g. 4 MB) split into much smaller requests (e.g. 8 KB) and passed to a kernel module, which in turn submits them to an NVMe device. This splitting of requests seems to hurt performance badly. Because of architectural reasons I cannot avoid the splitting of the requests by the issuer so I have to find a way to reconstruct the original, large request. The only way I can see to achieve this is to directly implement this functionality in that kernel module of mine, however before proceeding with the implementation I'd like to see some proof that merging the requests does solve the performance problem. The trouble is that I can't think of an easy way to do this as the NVMe device doesn't have a block queue interface, and I can't seem to find some virtual block layer that does it for me (e.g. dm-linear, mdadm linear, loopback). Is there any way I can effectively merge these requests? -- Thanos Makatos