Hi Muneedra, > The san_path_err_XX feature was added by me and pushed to the > upstream. > And this feature was driven from Brocade Customer Feedback. > > And the below link will give the history of this where couple of > discussions went before we started this feature. > > https://www.redhat.com/archives/dm-devel/2017-January/msg00025.html I'm aware that you authored the feature. I was not aware of that post you quoted, thanks for the link. Anyway, you mentioned in that post that the interested customers were using RHEL, have you made them upgrade their multipath-tools to recent upstream to use the san_path_err and/or marginal_path features? > Our requirement was simple > For example If there are two paths on a dm-1 say sda and sdb as > below. > > # multipath -ll > mpathd (3600110d001ee7f0102050001cc0b6751) dm-1 SANBlaze,VLUN MyLun > size=8.0M features='0' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=50 status=active > |- 8:0:1:0 sda 8:48 active ready running > `- 9:0:1:0 sdb 8:64 active ready running > > And on sda if iam seeing lot of errors due to which the sda path is > fluctuating from failed state to active state and vicevera. > > The requirement was something like this if sda is failed(moved > from > active to failed state) for more than X > times in a Y duration ,then I want to keep the sda in failed state > for Z > duration Thanks for clarifying what you meant with "is failed". I'd been wondering if it meant "good"->"failed" transitions, as you just confirmed, or overall "failed" state count. > And the data should travel only through sdb path for Z hrs. > > > From the configuration point of view > > san_path_err_threshold: The number of times the sda has been moved > from > active to failed (from the above example it is X) > san_path_err_forget_rate: Watch window (within this time frame if > the path > failures (sda moving from active to failed ) are more than err > threshold > then don't reinstate the path) (from the above example it is Y) The "watch window" analogy fits if you have a stable path (no or only very rare failures over extended periods of time) which suddenly starts fluctuating. More precisely, a "background" failure rate clearly below "san_path_err_forget_rate", interchanging with problematic periods in which the failure rate is significantly higher than "san_path_err_forget_rate". And that's is the situation the algorithm was made for, right? In general, the "time" (in ticks) to reach the treshold is t = T / max(1/R - 1/F, 0) Where T is san_path_err_threshold, R is the average time (in ticks) between "good"->"failed" transitions of the path, and F is san_path_err_forget_rate (aka the time in ticks after which "path_failures" is decremented by 1). If R >= F, t is infinite; the "path_failures" count effectively stays 0. If R is much smaller than F, t ~ T * R. If R is only a little bit smaller than F, t is finite but (possibly much) larger than T * R. That's why I sloppily called F the "maximum tolerable failure rate" in my previous post. Best regards, Martin -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel