Re: [PATCH v7 0/2] multipath-tools: intermittent IO error accounting to improve reliability

Guan Junxiong <guanjunxiong@xxxxxxxxxx> · Thu, 2 Nov 2017 08:49:51 +0800

Dear Christophe,

Could you please consider applying this patch or give any feedback about it?
We (Huawei and Brocade) are looking forward to you reply.
Thanks.

Regards
Guan Junxiong

.

On 2017/10/24 9:57, Guan Junxiong wrote:
> Hi Christophe and All,
> 
> This patch set adds a new method of path state checking based on accounting
> IO error. This is useful in many scenarios such as intermittent IO error
> on a path due to intermittent frame drops, intermittent corruptions, network
> congestion or a shaky link.
> 
> This patch set is of significance because of this (quoted from the discussion
> with Muneendra, Brocade):
> 
> There are typically two type of SAN network problems that are categorized as
> marginal issues. These issues by nature are not permanent in time and do come
> and go away over time.
> 1) Switches in the SAN can have intermittent frame drops or intermittent
>    frame corruptions due to bad optics cable (SFP) or any such wear/tear port
>    issues. This causes ITL flows that go through the faulty switch/port to
>    intermittently experience frame drops.  
> 2) There exists SAN topologies where there are switch ports in the fabric
>    that becomes the only  conduit for many different ITL(host--target--LUN)
>    flows across multiple hosts. These single network paths are essentially
>    shared across multiple ITL flows. Under these conditions if the port link
>    bandwidth is not able to handle the net sum of the shared ITL flows bandwidth
>    going through the single path  then we could see intermittent network
>    congestion problems. This condition is called network oversubscription.
>    The intermittent congestions can delay SCSI exchange completion time
>    (increase in I/O latency is observed).
> 
> To overcome the above network issues and many more such target issues, there
> are frame level retries that are done in HBA device firmware and I/O retries
> in the SCSI layer. These retries might succeed because of two reasons:
> 1) The intermittent switch/port issue is not observed
> 2) The retry I/O is a new  SCSI exchange. This SCSI exchange can take an
>    alternate SAN path for the ITL flow, if such an SAN path exists.
> 3) Network congestion disappears momentarily because the net I/O bandwidth
>    coming from multiple ITL flows on the single shared network path is
>    something the path can handle
> 
> However in some cases we have seen I/O retries don't succeed because the retry
> I/Os hits a SAN network path that has intermittent switch/port issue and/or
> network congestion. 
> 
> On the host thus we see configurations two or more ITL path sharing the same
> target/LUN going through two or more HBA ports. These HBA ports are connected
> to two or more SAN to the same target/LUN.
> If the I/O fails at the multipath layer then, the ITL path is turned into
> Failed state. Because of the marginal nature of the network, the next Health
> Check command sent from multipath layer might succeed, which results in making
> the ITL path into Active state. You end up seeing the DM path state going into
> Active, Failed, Active transitions. This results in overall reduction in
> application I/O throughput and sometime application I/O failures (because of
> timing constraints). All this can happen because of I/O retries and I/O request
> moving across multiple paths of the DM device. In the host it is to be noted
> all I/O retries on a single path and I/O movement across multiple paths results
> in slowing down the forward progress of new application I/O. Reason behind,
> the above I/O re-queue actions are given higher priority than the newer I/O
> requests coming from the application. 
> 
> The above condition of the  ITL path is hence called "marginal".
> 
> What we desire is for the DM to deterministically  categorize a ITL Path as
> “marginal” and move all the pending I/Os from the marginal Path to an Active
> Path. This will help in meeting application I/O timing constraints. Also a
> capability to automatically re-instantiate the marginal path into Active once
> the marginal condition in the network is fixed.
> 
> 
> Here is the description of implementation:
> 1) PATCH 1/2 implements the algorithm that sends a couple of continuous IOs
> to a path which suffers two failed events in less than a given time. Those
> IOs are sent at a fix rate of 10 Hz.
> 2) PATCH 2/2 discard the original algorithm because of this:
> the detect sample interval of that path checkers is so big/coarse that
> it doesn't see what happens in the middle of the sample interval. We have
> the PATCH 1/2 as a better method.
> 
> 
> Changes from V6:
> * fix the warning of unwrapped commit description in patch 1/2 
> * add Reviewed-by tag of Muneendra
> * add detailed scenario discription in the cover letter
> 
> Changes from V5:
> * rebase on the latest release 0.7.3 
> 
> 
> Changes from V4:
> * path_io_err_XXX -> marginal_path_err_XXX. (Mumeendra)
> * add one more parameters named marginal_path_double_failed_time instead
>   of the fixed 60 seconds for the pre-checking of a shaky path. (Martin)
> * fix for "reschedule checking after %d seconds" log 
> * path_io_err_recovery_time -> marginal_path_err_recheck_gap_time.
> * put the marginal path into PATH_SHAKY instead of PATH_DELAYED 
> * Modify the commit comments to sync with the changes above.
> 
> 
> Changes from V3:
> * add a patch for discard the san_path_XXX_feature 
> * fail the path in the kernel before enqueueing the path for checking
>   rather than after knowing the checking result to make it more
>   reliable. (Martin)
> * use posix_memalign instead of manual alignment for direct IO buffer. (Martin) 
> * use PATH_MAX to avoid certain compiler warning when opening file
>   rather than FILE_NAME_SIZE. (Martin)
> * discard unnecessary sanity check when getting block size (Martin)
> * do not return 0 in send_each_aync_io if io_starttime of a path is
>   not set(Martin)
> * Wait 10ms instead of 60 second if every path is down. (Martin)
> * rename handle_async_io_timeout to poll_async_io_timeout and use polling
>   method because io_getevents does not return 0 if there are timeout IO
>   and normal IO.
> * rename hit_io_err_recover_time ro hit_io_err_recheck_time 
> * modify the multipath.conf.5 and commit comments to keep sync with the
>   above changes
> 
> 
> Changes from V2:
> * fix uncondistional rescedule forverver
> * use script/checkpatch.pl in Linux to cleanup informal coding style
> * fix "continous" and "internel" typos
> 
> 
> Changes from V1:
> * send continous IO instead of a single IO in a sample interval (Martin)
> * when recover time expires, we reschedule the checking process (Hannes)
> * Use the error rate threshold as a permillage instead of IO number(Martin)
> * Use a common io_context for libaio for all paths (Martin)
> * Other small fixes (Martin)
> 
> 
> Junxiong Guan (2):
>   multipath-tools: intermittent IO error accounting to improve
>     reliability
>   multipath-tools: discard san_path_err_XXX feature
> 
>  libmultipath/Makefile      |   5 +-
>  libmultipath/config.c      |   3 -
>  libmultipath/config.h      |  21 +-
>  libmultipath/configure.c   |   7 +-
>  libmultipath/dict.c        |  88 +++---
>  libmultipath/io_err_stat.c | 744 +++++++++++++++++++++++++++++++++++++++++++++
>  libmultipath/io_err_stat.h |  15 +
>  libmultipath/propsel.c     |  70 +++--
>  libmultipath/propsel.h     |   7 +-
>  libmultipath/structs.h     |  15 +-
>  libmultipath/uevent.c      |  32 ++
>  libmultipath/uevent.h      |   2 +
>  multipath/multipath.conf.5 |  89 ++++--
>  multipathd/main.c          | 140 ++++-----
>  14 files changed, 1043 insertions(+), 195 deletions(-)
>  create mode 100644 libmultipath/io_err_stat.c
>  create mode 100644 libmultipath/io_err_stat.h
> 

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel