Hi Benjamin, Thank you very much for your comments. Please find my replys and the up-to-date patch. Best regards! > > First, one overall question. We have dynamic path selectors available to > deal with paths that are just simply slower that other paths, but can > still be used together. Is there specific hardware or a specific setup > where this isn't good enough and we really need to seperate these paths > into different pathgroups, but we can't find out deterministically how > the groups should be set up? It just seems like there could be a less > hacky solution to this problem, but perhaps there are some situations > where this is truly the best option. I'm just wondering what those are.1. In the Storage-Backup environment of HyperCluster,includs one storage array near to the host and one remote storage array, and the two storage arrays have the same hardware. The same LUN is writed or readed by the two storage arrays. However, usually, the average latency of the paths of the remote storage array is much higher than the near storage array's. apparently, the prioritizer can be a good automatic solution. And the current selectors don't solve it, IOs will send to the paths of the remote storage array, IOPS will be influenced unavoidably. 2. In the environment of single storage array, the prioritizer can automatically separate the paths who's latency is much higher, IOs will not send to this paths. But the current selectors don't solve this problem, IOPS will be influenced unavoidably. >> + >> +/* interval_unit_str and interval_unit_type keep the same assignment sequence */ >> +static const char interval_unit_str[][MAX_CHAR_SIZE] = { >> + CHAR_USEC, CHAR_MSEC, CHAR_SEC > > This is a nit, but for constant strings, could you please use "char > *var" instead of "char var[]", to be consistent with the rest of the > multipath code. Thanks, as the following patch. >> + if ((args == NULL) || (interval == NULL) >> + || (consnum == NULL) || (type == NULL)) >> + return 0; >> + >> + /* int type */ >> + if ((size < 1) || (size > MAX_CHAR_SIZE-1)) >> + return 0; > > You should probably have log messages for these error returns. Thanks, as the following patch. >> + >> + memcpy(source, args, size+1); >> + if (strstr(source, vertica) == NULL) >> + return 0; >> + >> + *type = get_interval_type(source, typestr); >> + if (*type == INTERVAL_INVALID) >> + { >> + condlog(0, "delay_interval type is invalid"); >> + return 0; >> + } > > I'm confused here. How do you get to use the default interval. Shouldn't > you accept "20s|" and "|30" and as valid inputs that use the defaults > for the part they don't specify. OK,the default arguments value is removed. If get inputs failed, return default priority "0". As the following patch. > >> + tokenbefore = strtok(source, vertica); >> + tokenafter = strtok(NULL, vertica); >> + typestr[1] = '\0'; >> + tokenbefore = strtok(tokenbefore, typestr); >> + if ((tokenbefore == NULL) || (tokenafter == NULL)) >> + return 0; >> + >> + tmp = tokenbefore; >> + while (*tmp != '\0') >> + if (!isdigit(*tmp++)) >> + { >> + condlog(0, "delay_interval string include invalid char"); >> + return 0; >> + } >> + >> + tmp = tokenafter; >> + while (*tmp != '\0') >> + if (!isdigit(*tmp++)) >> + { >> + condlog(0, "cons_num string include invalid char"); >> + return 0; >> + } >> + >> + *interval = atoi(tokenbefore); > > Why do you keep track of the type and the interval seperately? Can't you > just find out the type, and use that to multiply the interval once you > read it, and then just use that value, instead of keeping track of two > values across multiple functions? Thanks, as the following patch. >> + >> + if (pp->fd < 0) >> + return -PRIO_NO_INFORMATION; >> + >> + if (get_delay_pref_arg(args, &delay_interval, &cons_num, &type) == 0) >> + { >> + condlog(3, "%s: get delay arg fail", pp->dev); > > Why use the word "fail" in this message? Not setting prio_args to get > the defaults seems like a perfectly valid choice. The defaults are not used. Insteadly, return default priority "0". See below. >> diff --git a/libmultipath/prioritizers/delayedpath.h b/libmultipath/prioritizers/delayedpath.h >> new file mode 100644 >> index 0000000..d8213e9 >> --- /dev/null >> +++ b/libmultipath/prioritizers/delayedpath.h >> @@ -0,0 +1,17 @@ >> +#ifndef _DELAYEDPATH_H >> +#define _DELAYEDPATH_H >> + >> +#define PRIO_DELAYED_PATH "delayedpath" > > In order for the rest of the code to refer to this prioritizer, this > define should be in prio.h with the other prioritizer names, and as long > as delayedpath.c includes prio.h, there's no need to put it in > delayedpath.h. OK, as the following patch. > >> + >> +#define PRIO_NO_INFORMATION 5 > > The rest of the multipath code only cares if getprio returns a negative > number of not. It doesn't check what the specific negative number is. I > realize the the alua prioritizer returns a set of error codes, but they > aren't used, or even usable in their present form. If we wanted to have > better error reporting, we should set up a universal set of error codes > in prio.h, and have all prioritizers use them, instead of having each > prioritizer define its own error codes. There's no reason why your > prioritizer needs to return this error code instead of -1. OK, as the following patch. > >> + >> +#define USEC_PER_SEC 1000000LL >> +#define USEC_PER_MSEC 1000LL >> +#define USEC_PER_USEC 1LL >> + >> +static inline long long timeval_to_us(const struct timespec *tv) >> +{ >> + return ((long long) tv->tv_sec * USEC_PER_SEC) + (tv->tv_nsec >> 10); >> +} > > No other file besides delayedpath.c will likely be including this .h > file, so I don't see any purpose for these being defined here. In fact, > I don't see why you can't just have a .c file without a .h file like the > majority of prioritizers. I'm pretty sure that none of the prioritizers > really need their own .h file. OK, as the following patch. > >> +#endif >> diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5 >> index 5939688..f1e126e 100644 >> --- a/multipath/multipath.conf.5 >> +++ b/multipath/multipath.conf.5 >> @@ -293,6 +293,10 @@ Generate a random priority between 1 and 10. >> Generate the path priority based on the regular expression and the >> priority provided as argument. Requires prio_args keyword. >> .TP >> +.I delayedpath >> +Generate the path priority based on a time-delay algorithm. >> +Requires prio_args keyword. > > Really it doesn't require prio_args if you want to use the default > values, and should probably say so. The default args is discarded, as the following patch. >> +.I delayed >> +Needs a value of the form >> +\fI"<delay_interval|cons_num>"\fR >> +.RS >> +.TP 8 >> +.I delay_interval >> +The interval values of average IO-time-delay between two different neighbour ranks of path priority, used to partition different priority ranks. >> +Form: XXs, or XXXus, or XXXms. Unit: Second, or Microsecond, or Millisecond. Valid Values: Integer, s [1, 60], ms [1, 1000), us [1, 1000), >> +For example: 10s, or 100us, or 100ms. The default is: 10ms. >> +.TP >> +.I cons_num >> +The number of read IOs sent to the current path continuously, used to calculate the average IO-time-delay. Valid Values: Integer, [3, 1000]. >> +For example: 30. The default is: 20. >> +.RE >> +.TP 12 > > Looking at the "weighted" prio_args definition just above your "delayed" > definition, the pipe character "|" is being used to say that any of a > set of options is allowed. Your definition has it being a literal > character, but it's still inside the angle brackets that usually > delineate a variable. perhaps "<delay_interval>|<io_num>" would be > easier to understand, or even "[delayed_interval]|[io_num]" if you can > omit these to use the defaults. OK, as the following patch. The up-to-date patch as follows: --- >From 58d718fdd34550bd9c4a32c6e9a87099c1e45a9f Mon Sep 17 00:00:00 2001 From: Yang Feng <philip.yang@xxxxxxxxxx> Date: Fri, 19 May 2017 16:09:07 +0800 Subject: [PATCH] libmultipath/prioritizers: Prioritizer for device mapper multipath, where the corresponding priority values of specific paths are provided by a latency algorithm. And the latency algorithm is dependent on the following arguments(latency_interval and io_num). The principle of the algorithm is illustrated as follows: 1. By sending a certain number "cons_num" of read IOs to the current path continuously, the IOs' average latency can be calculated. 2. According to the average latency of each path and the weight value "latency_interval", the priority "rc" of each path can be provided. latency_interval latency_interval latency_interval latency_interval |------------------|------------------|------------------|...|------------------| | priority rank 1 | priority rank 2 | priority rank 3 |...| priority rank x | |------------------|------------------|------------------|...|------------------| Priority Rank Partitioning --- libmultipath/prioritizers/Makefile | 6 +- libmultipath/prioritizers/path_latency.c | 271 +++++++++++++++++++++++++++++++ multipath/multipath.conf.5 | 18 ++ libmultipath/prio.h | 1 + 4 files changed, 295 insertions(+), 1 deletion(-) create mode 100644 libmultipath/prioritizers/path_latency.c diff --git a/libmultipath/prioritizers/Makefile b/libmultipath/prioritizers/Makefile index 36b42e4..d2f20f6 100644 --- a/libmultipath/prioritizers/Makefile +++ b/libmultipath/prioritizers/Makefile @@ -18,13 +18,17 @@ LIBS = \ libpriorandom.so \ libpriordac.so \ libprioweightedpath.so \ - libpriosysfs.so + libpriopath_latency.so \ + libpriosysfs.so all: $(LIBS) libprioalua.so: alua.o alua_rtpg.o $(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^ +libpriopath_latency.so: path_latency.o ../checkers/libsg.o + $(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^ -lm + libprio%.so: %.o $(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^ diff --git a/libmultipath/prioritizers/path_latency.c b/libmultipath/prioritizers/path_latency.c new file mode 100644 index 0000000..a666b6c --- /dev/null +++ b/libmultipath/prioritizers/path_latency.c @@ -0,0 +1,271 @@ +/* + * (C) Copyright HUAWEI Technology Corp. 2017, 2021 All Rights Reserved. + * + * main.c + * + * Prioritizer for device mapper multipath, where the corresponding priority + * values of specific paths are provided by a latency algorithm. And the + * latency algorithm is dependent on arguments. + * + * The principle of the algorithm as follows: + * 1. By sending a certain number "io_num" of read IOs to the current path + * continuously, the IOs' average latency can be calculated. + * 2. According to the average latency of each path and the weight value + * "latency_interval", the priority "rc" of each path can be provided. + * + * Author(s): Yang Feng <philip.yang@xxxxxxxxxx> + * Zou Ming <zouming.zouming@xxxxxxxxxx> + * + * This file is released under the GPL. + */ +#include <stdio.h> +#include <math.h> +#include <ctype.h> +#include <time.h> + +#include "debug.h" +#include "prio.h" +#include "structs.h" +#include "../checkers/libsg.h" + +#define THRES_USEC_VALUE 120000000LL /*unit: us, =120s*/ + +#define MAX_IO_NUM 200 +#define MIN_IO_NUM 10 + +#define MAX_LATENCY_INTERVAL 60 /*unit: s*/ +#define MIN_LATENCY_INTERVAL 1 /*unit: us, or ms, or s*/ + +#define DEFAULT_PRIORITY 0 + +#define MAX_CHAR_SIZE 30 + +#define CHAR_USEC "us" +#define CHAR_MSEC "ms" +#define CHAR_SEC "s" + +enum interval_type { + INTERVAL_USEC, + INTERVAL_MSEC, + INTERVAL_SEC, + INTERVAL_INVALID +}; + +/* interval_unit_str and interval_unit_type keep the same assignment sequence */ +static const char *interval_unit_str[MAX_CHAR_SIZE] = { + CHAR_USEC, CHAR_MSEC, CHAR_SEC +}; +static const int interval_unit_type[] = { + INTERVAL_USEC, INTERVAL_MSEC, INTERVAL_SEC +}; + +#define USEC_PER_SEC 1000000LL +#define USEC_PER_MSEC 1000LL +#define USEC_PER_USEC 1LL + +static const int conversion_ratio[] = { + [INTERVAL_USEC] = USEC_PER_USEC, + [INTERVAL_MSEC] = USEC_PER_MSEC, + [INTERVAL_SEC] = USEC_PER_SEC, + [INTERVAL_INVALID] = 0 +}; + +static long long path_latency[MAX_IO_NUM]; + +static inline long long timeval_to_us(const struct timespec *tv) +{ + return ((long long) tv->tv_sec * USEC_PER_SEC) + (tv->tv_nsec >> 10); +} + +static int do_readsector0(int fd, unsigned int timeout) +{ + unsigned char buf[4096]; + unsigned char sbuf[SENSE_BUFF_LEN]; + int ret; + + ret = sg_read(fd, &buf[0], 4096, &sbuf[0], + SENSE_BUFF_LEN, timeout); + + return ret; +} + +int check_args_valid(int io_num, long long latency_interval, int type) +{ + if ((io_num < MIN_IO_NUM) || (io_num > MAX_IO_NUM)) + { + condlog(0, "args io_num is more than the valid values range"); + return 0; + } + + /* s:[1, 60], ms:[1, 60000], us:[1, 60000000] */ + if ((latency_interval < MIN_LATENCY_INTERVAL) || (latency_interval > (MAX_LATENCY_INTERVAL * USEC_PER_SEC / conversion_ratio[type]))) + { + condlog(0, "args latency_interval is more than the valid values range"); + return 0; + } + + return 1; +} + +static int get_interval_type(char *type) +{ + int index; + + for (index = 0; index < sizeof(interval_unit_str)/MAX_CHAR_SIZE; index++) + { + if (strcmp(type, interval_unit_str[index]) == 0) + { + return interval_unit_type[index]; + } + } + + return INTERVAL_INVALID; +} + +long long get_conversion_ratio(int type) +{ + return conversion_ratio[type]; +} + +/* In multipath.conf, args form: io_num|latency_interval. For example, +* args is "20|10ms", this function can get 20, 10. +*/ +static int get_interval_and_ionum(char *args, + int *ionum, + long long *interval) +{ + char source[MAX_CHAR_SIZE]; + char vertica = '|'; + char *endstrbefore = NULL; + char *endstrafter = NULL; + int type; + unsigned int size = strlen(args); + long long ratio; + + if ((args == NULL) || (ionum == NULL) || (interval == NULL)) + { + condlog(0, "args string is NULL"); + return 0; + } + + if ((size < 1) || (size > MAX_CHAR_SIZE-1)) + { + condlog(0, "args string's size is too long"); + return 0; + } + + memcpy(source, args, size+1); + + if (!isdigit(source[0])) + { + condlog(0, "args io_num string's first char is not digit"); + return 0; + } + + *ionum = (int)strtoul(source, &endstrbefore, 10); + if (endstrbefore[0] != vertica) + { + condlog(0, "segmentation char is invalid"); + return 0; + } + + if (!isdigit(endstrbefore[1])) + { + condlog(0, "args latency_interval string's first char is not digit"); + return 0; + } + + *interval = (long long)strtol(&endstrbefore[1], &endstrafter, 10); + type = get_interval_type(endstrafter); + if (type == INTERVAL_INVALID) + { + condlog(0, "args latency_interval type is invalid"); + return 0; + } + + if (check_args_valid(*ionum, *interval, type) == 0) + { + return 0; + } + + ratio = get_conversion_ratio(type); + *interval *= (long long)ratio; + + return 1; +} + +long long calc_standard_deviation(long long *path_latency, int size, long long avglatency) +{ + int index; + long long total = 0; + + for (index = 0; index < size; index++) + { + total += (path_latency[index] - avglatency) * (path_latency[index] - avglatency); + } + + total /= (size-1); + + return (long long)sqrt((double)total); +} + +int getprio (struct path *pp, char *args, unsigned int timeout) +{ + int rc, temp; + int index = 0; + int io_num; + long long latency_interval; + long long avglatency; + long long standard_deviation; + long long toldelay = 0; + long long before, after; + struct timespec tv; + + if (pp->fd < 0) + return -1; + + if (get_interval_and_ionum(args, &io_num, &latency_interval) == 0) + { + condlog(0, "%s: get path_latency args fail", pp->dev); + return DEFAULT_PRIORITY; + } + + memset(path_latency, 0, sizeof(path_latency)); + + temp = io_num; + while (temp-- > 0) + { + (void)clock_gettime(CLOCK_MONOTONIC, &tv); + before = timeval_to_us(&tv); + + if (do_readsector0(pp->fd, timeout) == 2) + { + condlog(0, "%s: path down", pp->dev); + return -1; + } + + (void)clock_gettime(CLOCK_MONOTONIC, &tv); + after = timeval_to_us(&tv); + + path_latency[index] = after - before; + toldelay += path_latency[index++]; + } + + avglatency = toldelay/(long long)io_num; + condlog(4, "%s: average latency is (%lld)", pp->dev, avglatency); + + if (avglatency > THRES_USEC_VALUE) + { + condlog(0, "%s: average latency (%lld) is more than thresold", pp->dev, avglatency); + return DEFAULT_PRIORITY; + } + + /* warn the user if the latency_interval set is smaller than (2 * standard deviation), or equal */ + standard_deviation = calc_standard_deviation(path_latency, index, avglatency); + if (latency_interval <= (2 * standard_deviation)) + condlog(3, "%s: args latency_interval set is smaller than 2 * standard deviation (%lld us), or equal", + pp->dev, standard_deviation); + + rc = (int)(THRES_USEC_VALUE - (avglatency/(long long)latency_interval)); + return rc; +} diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5 index 5939688..3dd0d77 100644 --- a/multipath/multipath.conf.5 +++ b/multipath/multipath.conf.5 @@ -293,6 +293,10 @@ Generate a random priority between 1 and 10. Generate the path priority based on the regular expression and the priority provided as argument. Requires prio_args keyword. .TP +.I path_latency +Generate the path priority based on a latency algorithm. +Requires prio_args keyword. +.TP .I datacore .\" XXX ???. Requires prio_args keyword. @@ -333,6 +337,20 @@ these values can be looked up through sysfs or by running \fImultipathd show pat "%N:%R:%n:%r"\fR. For example: 0x200100e08ba0aea0:0x210100e08ba0aea0:.*:.* , .*:.*:iqn.2009-10.com.redhat.msp.lab.ask-06:.* .RE .TP 12 +.I path_latency +Needs a value of the form +\fI"<latency_interval>|<io_num>"\fR +.RS +.TP 8 +.I latency_interval +The interval values of average latency between two different neighbour ranks of path priority, used to partition different priority ranks. +Form: XXs, or XXXus, or XXXms. Unit: Second, or Microsecond, or Millisecond. Valid Values: Integer, s [1, 60], ms [1, 60000], us [1, 60000000], +For example: If latency_interval=10ms, the paths will be grouped in priority groups with path latency 0-10ms, 10-20ms, 20-30ms, etc.. +.TP +.I io_num +The number of read IOs sent to the current path continuously, used to calculate the average path latency. Valid Values: Integer, [10, 200]. +.RE +.TP 12 .I alua If \fIexclusive_pref_bit\fR is set, paths with the \fIpreferred path\fR bit set will always be in their own path group. diff --git a/libmultipath/prio.h b/libmultipath/prio.h index 0193c52..c97fe39 100644 --- a/libmultipath/prio.h +++ b/libmultipath/prio.h @@ -29,6 +29,7 @@ struct path; #define PRIO_RDAC "rdac" #define PRIO_WEIGHTED_PATH "weightedpath" #define PRIO_SYSFS "sysfs" +#define PRIO_PATH_LATENCY "path_latency" /* * Value used to mark the fact prio was not defined -- -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel