I hope this message finds you well.
I have a cephfs cluster with 3 active mds, and use 3-node samba to
export through the kernel.
Currently, there are 2 node mds experiencing slow requests. We have
tried restarting the mds. After a few hours, the replay log status
became active.
But the slow request reappears. The slow request does not seem to come
from the client, but from the request of the mds node.
Looking forward to your prompt response.
HEALTH_WARN 2 MDSs report slow requests; 2 MDSs behind on trimming
[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests
mds.osd44(mds.0): 2 slow requests are blocked > 30 secs
mds.osd43(mds.1): 2 slow requests are blocked > 30 secs
[WRN] MDS_TRIM: 2 MDSs behind on trimming
mds.osd44(mds.0): Behind on trimming (18642/1024) max_segments:
1024, num_segments: 18642
mds.osd43(mds.1): Behind on trimming (976612/1024) max_segments:
1024, num_segments: 976612
mds.0
{
"ops": [
{
"description": "peer_request:mds.1:1",
"initiated_at": "2023-12-31T11:19:38.679925+0800",
"age": 4358.8009461359998,
"duration": 4358.8009636369998,
"type_data": {
"flag_point": "dispatched",
"reqid": "mds.1:1",
"op_type": "peer_request",
"leader_info": {
"leader": "1"
},
"events": [
{
"time": "2023-12-31T11:19:38.679925+0800",
"event": "initiated"
},
{
"time": "2023-12-31T11:19:38.679925+0800",
"event": "throttled"
},
{
"time": "2023-12-31T11:19:38.679925+0800",
"event": "header_read"
},
{
"time": "2023-12-31T11:19:38.679936+0800",
"event": "all_read"
},
{
"time": "2023-12-31T11:19:38.679940+0800",
"event": "dispatched"
}
]
}
},
{
"description": "peer_request:mds.1:2",
"initiated_at": "2023-12-31T11:19:38.679938+0800",
"age": 4358.8009326969996,
"duration": 4358.8009763549999,
"type_data": {
"flag_point": "dispatched",
"reqid": "mds.1:2",
"op_type": "peer_request",
"leader_info": {
"leader": "1"
},
"events": [
{
"time": "2023-12-31T11:19:38.679938+0800",
"event": "initiated"
},
{
"time": "2023-12-31T11:19:38.679938+0800",
"event": "throttled"
},
{
"time": "2023-12-31T11:19:38.679938+0800",
"event": "header_read"
},
{
"time": "2023-12-31T11:19:38.679941+0800",
"event": "all_read"
},
{
"time": "2023-12-31T11:19:38.679991+0800",
"event": "dispatched"
}
]
}
}
],
"complaint_time": 30,
"num_blocked_ops": 2
}
mds.1
{
"ops": [
{
"description": "internal op exportdir:mds.1:1",
"initiated_at": "2023-12-31T11:19:34.416451+0800",
"age": 4384.38814198,
"duration": 4384.3881617610004,
"type_data": {
"flag_point": "failed to wrlock, waiting",
"reqid": "mds.1:1",
"op_type": "internal_op",
"internal_op": 5377,
"op_name": "exportdir",
"events": [
{
"time": "2023-12-31T11:19:34.416451+0800",
"event": "initiated"
},
{
"time": "2023-12-31T11:19:34.416451+0800",
"event": "throttled"
},
{
"time": "2023-12-31T11:19:34.416451+0800",
"event": "header_read"
},
{
"time": "2023-12-31T11:19:34.416451+0800",
"event": "all_read"
},
{
"time": "2023-12-31T11:19:34.416451+0800",
"event": "dispatched"
},
{
"time": "2023-12-31T11:19:38.679923+0800",
"event": "requesting remote authpins"
},
{
"time": "2023-12-31T11:19:38.693981+0800",
"event": "failed to wrlock, waiting"
}
]
}
},
{
"description": "internal op exportdir:mds.1:2",
"initiated_at": "2023-12-31T11:19:34.416482+0800",
"age": 4384.3881117999999,
"duration": 4384.3881714600002,
"type_data": {
"flag_point": "failed to wrlock, waiting",
"reqid": "mds.1:2",
"op_type": "internal_op",
"internal_op": 5377,
"op_name": "exportdir",
"events": [
{
"time": "2023-12-31T11:19:34.416482+0800",
"event": "initiated"
},
{
"time": "2023-12-31T11:19:34.416482+0800",
"event": "throttled"
},
{
"time": "2023-12-31T11:19:34.416482+0800",
"event": "header_read"
},
{
"time": "2023-12-31T11:19:34.416482+0800",
"event": "all_read"
},
{
"time": "2023-12-31T11:19:34.416482+0800",
"event": "dispatched"
},
{
"time": "2023-12-31T11:19:38.679929+0800",
"event": "requesting remote authpins"
},
{
"time": "2023-12-31T11:19:38.693995+0800",
"event": "failed to wrlock, waiting"
}
]
}
}
],
"complaint_time": 30,
"num_blocked_ops": 2
}
I can't find any other solution other than restarting the mds service
with slow requests.
Currently, the backlog of mds logs in the metadata pool exceeds 4TB.
Best regards,
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx