Hi, This patch-set is the updated version of request-based dm-multipath including a lot of bug fixes. (Most of them are corner case scenarios found in heavy stress testing and code reviews. Also thanks to everyone testing the previous patches and reporting bugs.) drivers/md/dm-ioctl.c | 13 drivers/md/dm-mpath.c | 194 +++++--- drivers/md/dm-table.c | 122 +++++ drivers/md/dm.c | 973 ++++++++++++++++++++++++++++++++++++++++-- drivers/md/dm.h | 21 fs/bio-integrity.c | 5 fs/bio.c | 2 include/linux/bio.h | 5 include/linux/device-mapper.h | 10 9 files changed, 1238 insertions(+), 107 deletions(-) The patches are created to replace the current version of the Alasdair's editing tree: PATCH 1: replacing dm-add-request-based-facility.patch PATCH 2: replacing dm-permit-request-based-facility.patch PATCH 3: replacing dm-enforce-new-queue-limits.patch PATCH 4: newly added to prevent lockdep warnings PATCH 5: replacing dm-mpath-change-to-be-request-based.patch PATCH 6: replacing block-add-gfp_mask-to-bio_integrity_clone.patch PATCH 7: replacing dm-add-gfp_mask-to-bio_integrity_clone.patch PATCH 8: replacing dm-retain-integrity-request-based-clones.patch The patches have some cosmetic dependencies on the previously posted dynamic load balancer patches: http://marc.info/?l=dm-devel&m=123736539330980&w=2 but no functional dependency on them. (Actually, dynamic load balancers rely on this patch-set for effectiveness.) Changes from the version in the Alasdair's editing tree are: - Fixed oops on table swapping because of broken suspend (PATCH 1) http://marc.info/?l=dm-devel&m=123667308218868&w=2 * dm_suspend() recognized that suspend completed while some clones were still in flight. So swapping/freeing table was possible against the in-use table, and it caused oops * The problem was a race condition on in-flight checking: new I/O could become in-flight while suspend code found no I/O was in-flight and went on * Fixed it by protecting the in-flight counting and queue suspension with queue lock. Also using q->in_flight instead of md->pending since it is straightforward to understand the meaning of the in-flight clones - Fixed NULL pointer reference when allocation for clone bio fails (From Hannes Reinecke) (PATCH 1) http://marc.info/?l=dm-devel&m=123666946315335&w=2 * free_bio_clone() used clone->end_io_data to get md * But clone->end_io_data can be NULL when free_bio_clone() is called from error handling path in clone_request_bios() * Fixed it by passing md directly to free_bio_clone() * Removed the work around code to check bio->bi_private: http://marc.info/?l=dm-devel&m=123666946315335&w=2 - Fixed unnecessary -EIO on table swapping by removing the map-existence check in dm_make_request() (PATCH 1) http://marc.info/?l=dm-devel&m=123322573410050&w=2 * The check was there to reject BIOs until a table is loaded and the device and the queue are initialized correctly * But the check is not needed because BIOs are rejected in __generic_make_request() by device size checking until the device and the queue are initialized - Fixed freeing of in-use md (PATCH 1) * Upper layer can close the device just after all BIOs are completed by blk_update_request() or blk_end_request(), even if the request is still in the process of completion. It creates a small window where dm_put() can free the md which is needed for the process of completion * Fixed it by incrementing md->holders while I/O is in-flight - Fixed memory leak (missing unmap) on failure of dispatching a mapped clone (PATCH 1) * dm_kill_request() was used in dm_dispatch_request() when the dispatching fails * But dm_kill_requst() is for not-mapped clone, so target's rq_end_io() function is not called. This caused memory leak. * Fixed it by creating dm_complete_request() for mapped clone and calling it in dm_dispatch_request() - Changed exported function names to prevent misuse like above (PATCH 1, PATCH 3, PATCH 5) * dm_kill_request() => dm_kill_unmapped_request() * dm_requeue_request() => dm_requeue_unmapped_request() - Removed the REQ_QUIET flag in dm_kill_unmapped_request(), since "I/O error" message is not printed for failed I/O (PATCH 1) - Removed 'inline' for simple static functions, since compiler will do inline anyway (PATCH 1) - Removed if-request-based check in dm_prep_fn(), since the device is always request-based when dm_prep_fn() is called (PATCH 1) * The check was for the case where the table is switched from request-based to bio-based. But currently no plan to support such switching - Fixed the problem that switching device type is unexpectedly available on table swapping (PATCH 2) * table_load() checks md->map for the current table type to prevent the device type switching * But table_load() can run in parallel with dm_swap_table(), which has a small no-map window. So dm_init_md_mempool() in table_load() could switch the mempool type for an in-use device in this window * Fixed it by placing the pre-allocated mempools in table and binding them on resume. The resume is rejected if the device is in-use with different type - Fixed the problem that BIO is submitted to a request-based device in which some settings may not be done yet (PATCH 2) * The QUEUE_FLAG_STACKABLE flag was set in the middle of dm_table_set_restrictions() * But the flag needs to be set after all queue settings are done and they become visible to other CPUs, because: o BIOs can be submitted during dm_table_set_restrictions() o Once the QUEUE_FLAG_STACKABLE flag is set in the queue, BIOs are passed to request-based dm and eventually to __make_request() where the queue settings are used * Fixed it by re-ordering the flag set and putting a barrier before the flag set. - Fixed lockdep warnings (From Christof Schmitt) (PATCH 4) http://marc.info/?l=dm-devel&m=123501258326935&w=2 * lockdep warns some self-deadlock possibilities since: o map_lock is taken without disabling irq o request-based dm takes map_lock after taking queue_lock with disabling irq in dm_request_fn() o so logically a deadlock may happen: write_lock(map_lock) <interrupt> spin_lock_irqsave(queue_lock) dm_request_fn() => dm_get_table() => read_lock(map_lock) * dm_request_fn() is never called in interrupt context, so such self-deadlocks don't happen actually * But fixed the unneeded warnings by disabling irq when taking map_lock Some basic function/performance testings are done with NEC iStorage (active-active multipath), and no problem was found. Also heavy stress testing with a lot of path up/down and table switching is done to check panic, stall and memory leak don't occur. Please review and apply if no problem. Summary of the design and request-based dm-multipath are below. (No changes since the previous post.) BACKGROUND ========== Currently, device-mapper (dm) is implemented as a stacking block device at bio level. This bio-based implementation has an issue below on dm-multipath. Because hook for I/O mapping is above block layer __make_request(), contiguous bios can be mapped to different underlying devices and these bios aren't merged into a request. Dynamic load balancing could happen this situation, though it has not been implemented yet. Therefore, I/O mapping after bio merging is needed for better dynamic load balancing. The basic idea to resolve the issue is to move multipathing layer down below the I/O scheduler, and it was proposed from Mike Christie as the block layer (request-based) multipath: http://marc.info/?l=linux-scsi&m=115520444515914&w=2 Mike's patch added new block layer device for multipath and didn't have dm interface. So I modified his patch to be used from dm. It is request-based dm-multipath. DESIGN OVERVIEW =============== While currently dm and md stacks block devices at bio level, request-based dm stacks at request level and submits/completes struct request instead of struct bio. Overview of the request-based dm patch: - Mapping is done in a unit of struct request, instead of struct bio - Hook for I/O mapping is at q->request_fn() after merging and sorting by I/O scheduler, instead of q->make_request_fn(). - Hook for I/O completion is at bio->bi_end_io() and rq->end_io(), instead of only bio->bi_end_io() bio-based (current) request-based (this patch) ------------------------------------------------------------------ submission q->make_request_fn() q->request_fn() completion bio->bi_end_io() bio->bi_end_io(), rq->end_io() - Whether the dm device is bio-based or request-based is determined at table loading time - Keep user interface same (table/message/status/ioctl) - Any bio-based devices (like dm/md) can be stacked on request-based dm device. Request-based dm device *cannot* be stacked on any bio-based device. Expected benefit: - better load balancing Additional explanations: Why does request-based dm use bio->bi_end_io(), too? Because: - dm needs to keep not only the request but also bios of the request, if dm target drivers want to retry or do something on the request. For example, dm-multipath has to check errors and retry with other paths if necessary before returning the I/O result to the upper layer. - But rq->end_io() is called at the very late stage of completion handling where all bios in the request have been completed and the I/O results are already visible to the upper layer. So request-based dm hooks bio->bi_end_io() and doesn't complete the bio in error cases, and gives over the error handling to rq->end_io() hook. Thanks, Kiyoshi Ueda -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel