On 08/22/2017 12:07 PM, Guan Junxiong wrote: > This patch adds a new method of path state checking based on accounting > IO error. This is useful in many scenarios such as intermittent IO error > an a path due to network congestion, or a shaky link. > > Three parameters are added for the admin: "path_io_err_sample_time", > "path_io_err_num_threshold" and "path_io_err_recovery_time". > If path_io_err_sample_time and path_io_err_recovery_time are set to a > value greater than 0, when a path fail event occurs due to an IO error, > multipathd will enqueue this path into a queue of which each member is > sent direct reading asynchronous io at a fixed sample rate of 100HZ. The > IO accounting process for a path will last for path_io_err_sample_time. > If the number of IO error on a particular path is greater than the > path_io_err_num_threshold, then the path will not reinstate for > > This helps us place the path in delayed state if we hit a lot of > intermittent IO errors on a particular path due to network/target > issues and isolate such degraded path and allow the admin to rectify > the errors on a path. > > Signed-off-by: Junxiong Guan <guanjunxiong@xxxxxxxxxx> > --- There have been several attempts for this over the years; if you check the mail archive for 'flaky patch' you're bound to hit several threads discussing this. However, each has floundered for several problems: - As we now have advanced path selectors the overall consensus is that those selectors _should_ be able to handle these situations; ie for a flaky path the path selector should switch away from it and move the load to other, unaffected paths. Have you checked if the existing path selectors are able to cope with this situation? If not, why not? - But even if something like this is implemented, the real problem here is reliability. Multipath internally only considers two real path states; useable and unuseable. Consequently the flaky path needs to be placed in one of these; so with your patch after enough errors accumulate the flaky path will be placed in an unuseable state eventually. If a failover event occurs the daemon cannot switch to the flaky paths, and the system becomes unuseable even though I/O could be sent via the flaky paths. - However, flaky path detection is implemented, it will work most efficiently when moving I/O _away_ from the flaky path. However, in doing so we don't have a mechanism to figure out if and when the path is useable again (as we're not sending I/O to it, and the TUR or any other path checker might not be affected from the flaky behaviour). So when should we declare a path as 'good' again? Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel