mds generates slow request: peer_request, how to deal with it?

David Yang <gmydw1118@xxxxxxxxx> · Sun, 31 Dec 2023 16:01:27 +0800

I hope this message finds you well.

I have a cephfs cluster with 3 active mds, and use 3-node samba to
export through the kernel.

Currently, there are 2 node mds experiencing slow requests. We have
tried restarting the mds. After a few hours, the replay log status
became active.
But the slow request reappears. The slow request does not seem to come
from the client, but from the request of the mds node.

Looking forward to your prompt response.

HEALTH_WARN 2 MDSs report slow requests; 2 MDSs behind on trimming
[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests
    mds.osd44(mds.0): 2 slow requests are blocked > 30 secs
    mds.osd43(mds.1): 2 slow requests are blocked > 30 secs
[WRN] MDS_TRIM: 2 MDSs behind on trimming
    mds.osd44(mds.0): Behind on trimming (18642/1024) max_segments:
1024, num_segments: 18642
    mds.osd43(mds.1): Behind on trimming (976612/1024) max_segments:
1024, num_segments: 976612

mds.0

{
    "ops": [
        {
            "description": "peer_request:mds.1:1",
            "initiated_at": "2023-12-31T11:19:38.679925+0800",
            "age": 4358.8009461359998,
            "duration": 4358.8009636369998,
            "type_data": {
                "flag_point": "dispatched",
                "reqid": "mds.1:1",
                "op_type": "peer_request",
                "leader_info": {
                    "leader": "1"
                },
                "events": [
                    {
                        "time": "2023-12-31T11:19:38.679925+0800",
                        "event": "initiated"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679925+0800",
                        "event": "throttled"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679925+0800",
                        "event": "header_read"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679936+0800",
                        "event": "all_read"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679940+0800",
                        "event": "dispatched"
                    }
                ]
            }
        },
        {
            "description": "peer_request:mds.1:2",
            "initiated_at": "2023-12-31T11:19:38.679938+0800",
            "age": 4358.8009326969996,
            "duration": 4358.8009763549999,
            "type_data": {
                "flag_point": "dispatched",
                "reqid": "mds.1:2",
                "op_type": "peer_request",
                "leader_info": {
                    "leader": "1"
                },
                "events": [
                    {
                        "time": "2023-12-31T11:19:38.679938+0800",
                        "event": "initiated"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679938+0800",
                        "event": "throttled"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679938+0800",
                        "event": "header_read"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679941+0800",
                        "event": "all_read"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679991+0800",
                        "event": "dispatched"
                    }
                ]
            }
        }
    ],
    "complaint_time": 30,
    "num_blocked_ops": 2
}

mds.1

{
    "ops": [
        {
            "description": "internal op exportdir:mds.1:1",
            "initiated_at": "2023-12-31T11:19:34.416451+0800",
            "age": 4384.38814198,
            "duration": 4384.3881617610004,
            "type_data": {
                "flag_point": "failed to wrlock, waiting",
                "reqid": "mds.1:1",
                "op_type": "internal_op",
                "internal_op": 5377,
                "op_name": "exportdir",
                "events": [
                    {
                        "time": "2023-12-31T11:19:34.416451+0800",
                        "event": "initiated"
                    },
                    {
                        "time": "2023-12-31T11:19:34.416451+0800",
                        "event": "throttled"
                    },
                    {
                        "time": "2023-12-31T11:19:34.416451+0800",
                        "event": "header_read"
                    },
                    {
                        "time": "2023-12-31T11:19:34.416451+0800",
                        "event": "all_read"
                    },
                    {
                        "time": "2023-12-31T11:19:34.416451+0800",
                        "event": "dispatched"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679923+0800",
                        "event": "requesting remote authpins"
                    },
                    {
                        "time": "2023-12-31T11:19:38.693981+0800",
                        "event": "failed to wrlock, waiting"
                    }
                ]
            }
        },
        {
            "description": "internal op exportdir:mds.1:2",
            "initiated_at": "2023-12-31T11:19:34.416482+0800",
            "age": 4384.3881117999999,
            "duration": 4384.3881714600002,
            "type_data": {
                "flag_point": "failed to wrlock, waiting",
                "reqid": "mds.1:2",
                "op_type": "internal_op",
                "internal_op": 5377,
                "op_name": "exportdir",
                "events": [
                    {
                        "time": "2023-12-31T11:19:34.416482+0800",
                        "event": "initiated"
                    },
                    {
                        "time": "2023-12-31T11:19:34.416482+0800",
                        "event": "throttled"
                    },
                    {
                        "time": "2023-12-31T11:19:34.416482+0800",
                        "event": "header_read"
                    },
                    {
                        "time": "2023-12-31T11:19:34.416482+0800",
                        "event": "all_read"
                    },
                    {
                        "time": "2023-12-31T11:19:34.416482+0800",
                        "event": "dispatched"
                    },
                    {
                        "time": "2023-12-31T11:19:38.679929+0800",
                        "event": "requesting remote authpins"
                    },
                    {
                        "time": "2023-12-31T11:19:38.693995+0800",
                        "event": "failed to wrlock, waiting"
                    }
                ]
            }
        }
    ],
    "complaint_time": 30,
    "num_blocked_ops": 2
}

I can't find any other solution other than restarting the mds service
with slow requests.

Currently, the backlog of mds logs in the metadata pool exceeds 4TB.

Best regards,
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx