Dear Christophe, Could you please consider applying this patch or give any feedback about it? We (Huawei and Brocade) are looking forward to you reply. Thanks. Regards Guan Junxiong . On 2017/10/24 9:57, Guan Junxiong wrote: > Hi Christophe and All, > > This patch set adds a new method of path state checking based on accounting > IO error. This is useful in many scenarios such as intermittent IO error > on a path due to intermittent frame drops, intermittent corruptions, network > congestion or a shaky link. > > This patch set is of significance because of this (quoted from the discussion > with Muneendra, Brocade): > > There are typically two type of SAN network problems that are categorized as > marginal issues. These issues by nature are not permanent in time and do come > and go away over time. > 1) Switches in the SAN can have intermittent frame drops or intermittent > frame corruptions due to bad optics cable (SFP) or any such wear/tear port > issues. This causes ITL flows that go through the faulty switch/port to > intermittently experience frame drops. > 2) There exists SAN topologies where there are switch ports in the fabric > that becomes the only conduit for many different ITL(host--target--LUN) > flows across multiple hosts. These single network paths are essentially > shared across multiple ITL flows. Under these conditions if the port link > bandwidth is not able to handle the net sum of the shared ITL flows bandwidth > going through the single path then we could see intermittent network > congestion problems. This condition is called network oversubscription. > The intermittent congestions can delay SCSI exchange completion time > (increase in I/O latency is observed). > > To overcome the above network issues and many more such target issues, there > are frame level retries that are done in HBA device firmware and I/O retries > in the SCSI layer. These retries might succeed because of two reasons: > 1) The intermittent switch/port issue is not observed > 2) The retry I/O is a new SCSI exchange. This SCSI exchange can take an > alternate SAN path for the ITL flow, if such an SAN path exists. > 3) Network congestion disappears momentarily because the net I/O bandwidth > coming from multiple ITL flows on the single shared network path is > something the path can handle > > However in some cases we have seen I/O retries don't succeed because the retry > I/Os hits a SAN network path that has intermittent switch/port issue and/or > network congestion. > > On the host thus we see configurations two or more ITL path sharing the same > target/LUN going through two or more HBA ports. These HBA ports are connected > to two or more SAN to the same target/LUN. > If the I/O fails at the multipath layer then, the ITL path is turned into > Failed state. Because of the marginal nature of the network, the next Health > Check command sent from multipath layer might succeed, which results in making > the ITL path into Active state. You end up seeing the DM path state going into > Active, Failed, Active transitions. This results in overall reduction in > application I/O throughput and sometime application I/O failures (because of > timing constraints). All this can happen because of I/O retries and I/O request > moving across multiple paths of the DM device. In the host it is to be noted > all I/O retries on a single path and I/O movement across multiple paths results > in slowing down the forward progress of new application I/O. Reason behind, > the above I/O re-queue actions are given higher priority than the newer I/O > requests coming from the application. > > The above condition of the ITL path is hence called "marginal". > > What we desire is for the DM to deterministically categorize a ITL Path as > “marginal” and move all the pending I/Os from the marginal Path to an Active > Path. This will help in meeting application I/O timing constraints. Also a > capability to automatically re-instantiate the marginal path into Active once > the marginal condition in the network is fixed. > > > Here is the description of implementation: > 1) PATCH 1/2 implements the algorithm that sends a couple of continuous IOs > to a path which suffers two failed events in less than a given time. Those > IOs are sent at a fix rate of 10 Hz. > 2) PATCH 2/2 discard the original algorithm because of this: > the detect sample interval of that path checkers is so big/coarse that > it doesn't see what happens in the middle of the sample interval. We have > the PATCH 1/2 as a better method. > > > Changes from V6: > * fix the warning of unwrapped commit description in patch 1/2 > * add Reviewed-by tag of Muneendra > * add detailed scenario discription in the cover letter > > Changes from V5: > * rebase on the latest release 0.7.3 > > > Changes from V4: > * path_io_err_XXX -> marginal_path_err_XXX. (Mumeendra) > * add one more parameters named marginal_path_double_failed_time instead > of the fixed 60 seconds for the pre-checking of a shaky path. (Martin) > * fix for "reschedule checking after %d seconds" log > * path_io_err_recovery_time -> marginal_path_err_recheck_gap_time. > * put the marginal path into PATH_SHAKY instead of PATH_DELAYED > * Modify the commit comments to sync with the changes above. > > > Changes from V3: > * add a patch for discard the san_path_XXX_feature > * fail the path in the kernel before enqueueing the path for checking > rather than after knowing the checking result to make it more > reliable. (Martin) > * use posix_memalign instead of manual alignment for direct IO buffer. (Martin) > * use PATH_MAX to avoid certain compiler warning when opening file > rather than FILE_NAME_SIZE. (Martin) > * discard unnecessary sanity check when getting block size (Martin) > * do not return 0 in send_each_aync_io if io_starttime of a path is > not set(Martin) > * Wait 10ms instead of 60 second if every path is down. (Martin) > * rename handle_async_io_timeout to poll_async_io_timeout and use polling > method because io_getevents does not return 0 if there are timeout IO > and normal IO. > * rename hit_io_err_recover_time ro hit_io_err_recheck_time > * modify the multipath.conf.5 and commit comments to keep sync with the > above changes > > > Changes from V2: > * fix uncondistional rescedule forverver > * use script/checkpatch.pl in Linux to cleanup informal coding style > * fix "continous" and "internel" typos > > > Changes from V1: > * send continous IO instead of a single IO in a sample interval (Martin) > * when recover time expires, we reschedule the checking process (Hannes) > * Use the error rate threshold as a permillage instead of IO number(Martin) > * Use a common io_context for libaio for all paths (Martin) > * Other small fixes (Martin) > > > Junxiong Guan (2): > multipath-tools: intermittent IO error accounting to improve > reliability > multipath-tools: discard san_path_err_XXX feature > > libmultipath/Makefile | 5 +- > libmultipath/config.c | 3 - > libmultipath/config.h | 21 +- > libmultipath/configure.c | 7 +- > libmultipath/dict.c | 88 +++--- > libmultipath/io_err_stat.c | 744 +++++++++++++++++++++++++++++++++++++++++++++ > libmultipath/io_err_stat.h | 15 + > libmultipath/propsel.c | 70 +++-- > libmultipath/propsel.h | 7 +- > libmultipath/structs.h | 15 +- > libmultipath/uevent.c | 32 ++ > libmultipath/uevent.h | 2 + > multipath/multipath.conf.5 | 89 ++++-- > multipathd/main.c | 140 ++++----- > 14 files changed, 1043 insertions(+), 195 deletions(-) > create mode 100644 libmultipath/io_err_stat.c > create mode 100644 libmultipath/io_err_stat.h > -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel