On Thu, Sep 22, 2016 at 6:57 AM, Bartlomiej Zolnierkiewicz <b.zolnierkie@xxxxxxxxxxx> wrote: > Since Linus Walleij is also working on that and I won't > probably have time to touch this code till the end of > upcoming month, here it is (basically a code dump of my > proof-of-concept work). I hope that it would be useful > to somebody. > > It is extremely ugly & full of bogus debug code but boots > fine on my Odroid-XU3 and benchmarks can be run. Haha, it is still good discussion material. FWIW your patchset is way more advanced than whatever I cooked up, and the approach taken: first rip out async requests, then adding a mq callback block and add async requests back after adding a function to monitor if the queue is busy is a way better approach. I sat down with Ulf Hansson and Arnd Bergmann to discuss the material and issues we face if/when migrating the MMC/SD code to blk-mq. Just for context to everyone: MMC/SD has an asynchronous request handling that achieves a call all the way into the driver to do some DMA mapping (flush) of SGlists with dma_map_sg() before the hardware start processing the actual request. There is a post_req() callback as well performing dma_unmap_sg(). This is mostly a non-issue on coherent memory architectures like x86, but gives a nice performance boost on ARM (etc) systems. In theory the callback could be used for other stuff but all current drivers ultimately call dma_map_sg()/dma_unmap_sg(). The interesting solution to achieve asynchronous requests, a.k.a. double-buffering a.k.a. request pipelining is basically this from the last patch: - mq->qdepth = 1; + mq->qdepth = 2; So we claim that the hardware queue has a depth of two requests but well... that is not really true. If we start confusing concepts like this to get parallelism, what shall we set this to when we exploit command queueing and actually have a queue depth of say 64? that will result in a pile of hacks. The proper solution would be to augment struct blk_mq_ops vtable with a .pre_queue_rq() and .post_complete_rq() or something. The way I read the code the init_request() and exit_request() callbacks cannot be used as they only deal with allocating the struct and this seems to happen before the request is actually filled in with the data (correct me if I don't understand this right!) this seems to be confirmed by the presence of a .reinit_request() callback. So we can't map/unmap the requests in these callbacks. We noted that this dma map/upmap optimization can also be applicable for USB mass storage, so we get an optimization from the MQ block layer that we can reuse in more than MMC/SD. After this we will still run into the same issue that you find after this patchset: regressions in performance because of the absence of an elevator/scheduler algorithm in blk-mq. So we cannot really apply the patch set before or at the same time as we're fixing that. Apart from that we saw some really arcane things in the MMC/SD core, mmc_claim_host() being the most obvious example, as far as we can tell some kind of reimplementation of mutex_trylock(). Some serious cleanup may be needed here. It's nice that your first patch rips out the quirky kthread that polls the block queue for new requests and send them down to the mmc core, including picking out a few NULL requests and flusing it's asynch work queue with that. Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html