MDS requests are buffered or dropped when the client's network is saturated. To alleviate this by giving priority to sockets in MDS. Signed-off-by: Minjong Kim <make.dirty.code@xxxxxxxxx> --- Hello. I am very new to kernel. I would appreciate it if you could understand my clumsy process. I'm not sure if I should post this as a general question, as a patch, or if I should write a comment like this here, but I'll write a few words. I've found that at the point where the client network saturates, requests from MDSs drop significantly. To solve this, I added code to the kernel's code to tag MDS sockets with IP_TOS. However, there are some problems caused by my inadequacies. First, is it okay to use higher-level functions like ip_setsockopt? This function works fine, but I haven't seen any other kernel code use it. Do I have to change the code like skb->priority manually? I'm mainly working on high-level code, so I'm careful about whether I can access these attributes directly. Second, IP_TOS seems to be a deprecated option. It seems to be managed through diffserv these days (though it is compatible with IP_TOS), but I couldn't find a function to tag dscp directly. In this case, using a function like ip_setsockopt(..IP_TOS) seems to be a problem, but I couldn't solve it in my own way. Third, the benchmarks I conducted seem to have many variables depending on various computing environments. I think I've done it several times as best I can, but this may be variable due to my local environment. Finally, this doesn't seem to be a perfect way to solve the problem. It seems that MDS packets are still buffered when burst. Also, it seems that many distributions these days use fq_codel by default, which doesn't support diffserv. But tagging IP_TOS doesn't seem to get any worse. (since the filesystem's workload is very small). The next version of fq_codel, cake, supports it, so there is a possibility that it will be improved. Thanks for reading this long post. Apart from the shortcomings in my code, please forgive me for the shortcomings in the kernel contributing process. net/ceph/messenger_v1.c | 14 ++++++++++++++ net/ceph/messenger_v2.c | 13 +++++++++++++ 2 files changed, 27 insertions(+) diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c index 3ddbde87e4d6..bab6ec4af82c 100644 --- a/net/ceph/messenger_v1.c +++ b/net/ceph/messenger_v1.c @@ -6,6 +6,7 @@ #include <linux/net.h> #include <linux/socket.h> #include <net/sock.h> +#include <net/ip.h> #include <linux/ceph/ceph_features.h> #include <linux/ceph/decode.h> @@ -1423,6 +1424,19 @@ int ceph_con_v1_try_write(struct ceph_connection *con) con->error_msg = "connect error"; goto out; } + + if (con->peer_name.type == CEPH_ENTITY_TYPE_MDS) { + __u8 tos_mds = 0xb0; // mark as AF32 + + ret = ip_setsockopt(con->sock->sk, SOL_IP, IP_TOS, + KERNEL_SOCKPTR(&tos_mds), 1); + + if (ret) { + pr_err("ip_setsockopt failed: %d\n", ret); + con->error_msg = "connect error"; + return ret; + } + } } more: diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c index cc8ff81a50b7..d87430f333c9 100644 --- a/net/ceph/messenger_v2.c +++ b/net/ceph/messenger_v2.c @@ -3180,6 +3180,19 @@ int ceph_con_v2_try_write(struct ceph_connection *con) con->error_msg = "connect error"; return ret; } + + if (con->peer_name.type == CEPH_ENTITY_TYPE_MDS) { + __u8 tos_mds = 0xb0; // mark as AF32 + + ret = ip_setsockopt(con->sock->sk, SOL_IP, IP_TOS, + KERNEL_SOCKPTR(&tos_mds), 1); + + if (ret) { + pr_err("ip_setsockopt failed: %d\n", ret); + con->error_msg = "connect error"; + return ret; + } + } } if (!iov_iter_count(&con->v2.out_iter)) { -- 2.25.1