Ok, so the real problem is per-cpu bounded tasks. I share Thomas opinion about a NAPI like approach.
We already have that, its irq_poll, but it seems that for this use-case, we get lower performance for some reason. I'm not entirely sure why that is, maybe its because we need to mask interrupts because we don't have an "arm" register in nvme like network devices have?