The following patches were made over Linus's tree but apply over Martin's branches. They allow userspace to configure how fabric drivers submit cmds to backend drivers. Right now loop and vhost use a worker thread, and the other drivers submit from the contexts they receive/process the cmd from. For multiple LUN cases where the target can queue more cmds than the backend can handle then defering to a worker thread is safest because the backend driver can block when doing things like waiting for a free request/tag. For cases where the backend devices can queue everything the target sends, then there is no need to defer to a workqueue and you can see a perf boost of up to 26% for small IO workloads. For a nvme device and vhost-scsi I can see with 4K IOs: fio jobs 1 2 4 8 10 -------------------------------------------------- workqueue submit 94K 190K 394K 770K 890K direct submit 128K 252K 488K 950K -