The bug is triggered if path failed event is received by multipathd after all paths have been already marked as failed. Surprisingly enough, it seems to happen quite often; colleague of mine who tested this hit this bug every time. Here is event sequence that explains this bug. I left some messages for clarity; full log is available on request. We have completed initialization and set feature queue_if_no_path for map CX_201 by virtue of using no_path_retry > 0. Aug 31 10:49:09 | CX_201: devmap event #18 Aug 31 10:49:09 | CX_201: discover Aug 31 10:49:09 | CX_201: rr_weight = 1 (internal default) Aug 31 10:49:09 | CX_201: pgfailback = -2 (controller setting) Aug 31 10:49:09 | CX_201: no_path_retry = 2 (controller setting) Aug 31 10:49:09 | pg_timeout = NONE (internal default) Aug 31 10:49:09 | 65:192: mark as failed Aug 31 10:49:09 | CX_201: remaining active paths: 3 Aug 31 10:49:09 | 8:192: mark as failed Aug 31 10:49:09 | CX_201: remaining active paths: 2 Aug 31 10:49:09 | CX_201: devmap event #19 Aug 31 10:49:09 | CX_201: discover Aug 31 10:49:09 | CX_201: rr_weight = 1 (internal default) Aug 31 10:49:09 | CX_201: pgfailback = -2 (controller setting) Aug 31 10:49:09 | CX_201: no_path_retry = 2 (controller setting) Aug 31 10:49:09 | pg_timeout = NONE (internal default) Two paths failed by driver, multipahd marked them as failed. Aug 31 10:49:09 | checker failed path 66:0 in map CX_201 Aug 31 10:49:09 | CX_201: remaining active paths: 1 Checker failed third path Aug 31 10:49:09 | checker failed path 8:96 in map CX_201 Aug 31 10:49:09 | CX_201: Entering recovery mode: max_retries=2 Aug 31 10:49:09 | CX_201: remaining active paths: 0 Checker failed last path; multipathd entered retry loop. Aug 31 10:49:10 | CX_201: devmap event #20 We got late event about failed path Aug 31 10:49:10 | CX_201: discover Start discovery. Call update_multipath -> setup_multipath -> update_multipath_strings -> update_multipath_tablle -> disassemble_map. Now disassemble_map tries to set no_path_retry value from kernel. This obviously is not going to work as kernel is able remembering only Boolean (queue/fail), while no_path_retry is arbitrary integer. So no_path_retry is set to NO_PATH_RETRY_QUEUE from kernel. Aug 31 10:49:10 | CX_201: rr_weight = 1 (internal default) Aug 31 10:49:10 | CX_201: pgfailback = -2 (controller setting) At this point we call set_no_path_retry: set_no_path_retry(struct multipath *mpp) { mpp->retry_tick = 0; mpp->nr_active = pathcount(mpp, PATH_UP) + pathcount(mpp, PATH_GHOST); if (mpp->nr_active > 0) select_no_path_retry(mpp); So 1) retry_tick is reset 2) nr_active = 0 (no active path) 3) we do not set no_path_retry from config file because nr_active == 0 => left with NO_PATH_RETRY_QUEUE. Aug 31 10:49:10 | pg_timeout = NONE (internal default) >From now on there is no state changes, so map is hung forever. Signed-off-by: Martin Wilck <martin.wilck@xxxxxxxxxxxxxx> Signed-off-by: Hannes Reinecke <hare@xxxxxxx> --- libmultipath/structs_vec.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c index 384afb7..7073915 100644 --- a/libmultipath/structs_vec.c +++ b/libmultipath/structs_vec.c @@ -306,8 +306,7 @@ set_no_path_retry(struct multipath *mpp) { mpp->retry_tick = 0; mpp->nr_active = pathcount(mpp, PATH_UP) + pathcount(mpp, PATH_GHOST); - if (mpp->nr_active > 0) - select_no_path_retry(mpp); + select_no_path_retry(mpp); switch (mpp->no_path_retry) { case NO_PATH_RETRY_UNDEF: -- 1.7.4.2 -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel