On Thu, 2017-08-24 at 17:59 +0800, Guan Junxiong wrote: > Hi, Hannes > Thanks for your comments. My reply inline. > > On 2017/8/22 23:37, Hannes Reinecke wrote: > > - As we now have advanced path selectors the overall consensus is > > that > > those selectors _should_ be able to handle these situations; ie for > > a > > flaky path the path selector should switch away from it and move > > the > > load to other, unaffected paths. > > Have you checked if the existing path selectors are able to cope > > with > > this situation? If not, why not? > > The existing path selectors in the kernel space are able to fail_path > the flaky path when certain IO errors occurs. However only the user- > space > multipathd's checkers can detect whether the path is up. Therefore, > for path > with long-time intermittent IO or flaky path, that path selectors > suffers > from taking in the path and taking out the path _again_ _and_ > _again_. > Even the san_path_err_threshold , san_path_err_forget_rate and > san_path_err_recovery_time > is turned on, the detect sample interval of that path checkers is so > big/coarse > that it doesn't see what happens in the middle of the sample > interval. I have the concern that we are introducing too many different regulation algorithms. We have path selectors, path checkers, san_path_err_XXX, and now path_io_err_XXX as well. We must be certain that these play together in a well-defined fashion (most importantly, avoid that one mechanism activates a path while the other is in the process of tearing it down, etc.). We must also avoid causing user confusion, as multipath configuration is already a daunting task for many. Your new algorithm should be mutually exclusive with san_path_err_XXX. Perhaps we should even consider dropping the san_path_err_XXX options entirely if we choose to adopt your new approach. > > - However, flaky path detection is implemented, it will work most > > efficiently when moving I/O _away_ from the flaky path. However, in > > doing so we don't have a mechanism to figure out if and when the > > path is > > useable again (as we're not sending I/O to it, and the TUR or any > > other > > path checker might not be affected from the flaky behaviour). > > So when should we declare a path as 'good' again? > > In this patch, the flaky path will stay only > path_io_err_recovery_time seconds > if there are more than one active path. After only > path_io_err_recovery_time seconds, > the flaky path will stay in normal, which means , when path checker > detects it > is up, it will reinstate into the usable path. > > However, how about we schedule the intermittent IO checking process > again when > the path_io_err_recovery_time seconds expires. If the number of IO > erros is less > than path_io_err_num_threshold, we declare the path as 'good' again. That sounds like a reasonable improvement over the original patch. Regards, Martin -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel