I've gone through the clients mentioned by the ops in flight and none of them are connected any more. The number of segments that the MDS is behind on is rising steadily and the ops_in_flight remain, this feels a lot like a catastrophe brewing. The documentation suggests trying to restart the MDS server, but the last time I did replay took two hours before any cephfs worked again, so I'd rather not risk that, if I can help it. Any hints are appreciated. # ceph health detail HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mds.dalmore(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 6046 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mds.dalmore(mds.0): 14 slow requests are blocked > 30 secs MDS_TRIM 1 MDSs behind on trimming mds.dalmore(mds.0): Behind on trimming (4515/128) max_segments: 128, num_segments: 4515 On Wed, 21 Apr 2021 at 19:09, Flemming Frandsen <dren.dk@xxxxxxxxx> wrote: > I've just spent a couple of hours waiting for an MDS server to replay a > journal that it was behind on and it seems to be getting worse. > > The system is not terribly busy, but there are 14 ops in flight that are > very old and do not seem to go away on their own. > > Is there anything I can do to unwedge the mds server? > > > root@dalmore:~# ceph health detail > HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; > 1 MDSs behind on trimming > MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs > mds.dalmore(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest > blocked for 4107 secs > MDS_SLOW_REQUEST 1 MDSs report slow requests > mds.dalmore(mds.0): 14 slow requests are blocked > 30 secs > MDS_TRIM 1 MDSs behind on trimming > mds.dalmore(mds.0): Behind on trimming (3443/128) max_segments: 128, > num_segments: 3443 > > > root@dalmore:~# ceph daemon mds.dalmore dump_ops_in_flight > { > "ops": [ > { > "description": "client_request(client.9801215:348401522 readdir > #0x2001fba1c83 2021-04-21 15:46:30.820531 caller_uid=1000, > caller_gid=1000{})", > "initiated_at": "2021-04-21 15:46:30.822084", > "age": 4020.3439449050002, > "duration": 4020.343998116, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348401522", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348401522 > }, > "events": [ > { > "time": "2021-04-21 15:46:30.822084", > "event": "initiated" > }, > { > "time": "2021-04-21 15:46:30.822084", > "event": "header_read" > }, > { > "time": "2021-04-21 15:46:30.822090", > "event": "throttled" > }, > { > "time": "2021-04-21 15:46:30.822108", > "event": "all_read" > }, > { > "time": "2021-04-21 15:46:30.966336", > "event": "dispatched" > }, > { > "time": "2021-04-21 15:46:30.966419", > "event": "acquired locks" > }, > { > "time": "2021-04-21 16:01:40.084378", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084438", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9801215:348454506 > setfilelock rule 1, type 4, owner 14336643358275911908, pid 32085, start 0, > length 0, wait 0 #0x1001e694623 20 > 21-04-21 15:55:50.829175 caller_uid=1000, caller_gid=1000{})", > "initiated_at": "2021-04-21 15:55:50.832432", > "age": 3460.3335972099999, > "duration": 3460.333751787, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348454506", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348454506 > }, > "events": [ > { > "time": "2021-04-21 15:55:50.832432", > "event": "initiated" > }, > { > "time": "2021-04-21 15:55:50.832432", > "event": "header_read" > }, > { > "time": "2021-04-21 15:55:50.832439", > "event": "throttled" > }, > { > "time": "2021-04-21 15:55:50.832465", > "event": "all_read" > }, > { > "time": "2021-04-21 15:55:50.832608", > "event": "dispatched" > }, > { > "time": "2021-04-21 16:01:40.084448", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084451", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9143865:2383945773 > getattr pAsLsXsFs #0x10004f6a16b 2021-04-21 16:35:30.075110 caller_uid=0, > caller_gid=0{})", > "initiated_at": "2021-04-21 16:35:30.077995", > "age": 1081.0880337159999, > "duration": 1081.0882681850001, > "type_data": { > "flag_point": "failed to authpin, subtree is being > exported", > "reqid": "client.9143865:2383945773", > "op_type": "client_request", > "client_info": { > "client": "client.9143865", > "tid": 2383945773 > }, > "events": [ > { > "time": "2021-04-21 16:35:30.077995", > "event": "initiated" > }, > { > "time": "2021-04-21 16:35:30.077995", > "event": "header_read" > }, > { > "time": "2021-04-21 16:35:30.077999", > "event": "throttled" > }, > { > "time": "2021-04-21 16:35:30.078018", > "event": "all_read" > }, > { > "time": "2021-04-21 16:35:30.078091", > "event": "dispatched" > }, > { > "time": "2021-04-21 16:35:30.078156", > "event": "failed to authpin, subtree is being > exported" > } > ] > } > }, > { > "description": "client_request(client.9801215:348417261 lookup > #0x10004cd0675/tasks 2021-04-21 15:49:09.248696 caller_uid=1000, > caller_gid=1000{})", > "initiated_at": "2021-04-21 15:49:09.252834", > "age": 3861.9131950870001, > "duration": 3861.9135065690002, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348417261", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348417261 > }, > "events": [ > { > "time": "2021-04-21 15:49:09.252834", > "event": "initiated" > }, > { > "time": "2021-04-21 15:49:09.252834", > "event": "header_read" > }, > { > "time": "2021-04-21 15:49:09.252839", > "event": "throttled" > }, > { > "time": "2021-04-21 15:49:09.252856", > "event": "all_read" > }, > { > "time": "2021-04-21 15:49:09.252982", > "event": "dispatched" > }, > { > "time": "2021-04-21 15:49:09.253070", > "event": "failed to authpin, subtree is being > exported" > }, > { > "time": "2021-04-21 16:01:40.084439", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084444", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9143865:2383945793 > getattr pAsLsXsFs #0x10004f6a16b 2021-04-21 16:40:33.868658 caller_uid=0, > caller_gid=0{})", > "initiated_at": "2021-04-21 16:40:33.871565", > "age": 777.29446343799998, > "duration": 777.29486169500001, > "type_data": { > "flag_point": "failed to authpin, subtree is being > exported", > "reqid": "client.9143865:2383945793", > "op_type": "client_request", > "client_info": { > "client": "client.9143865", > "tid": 2383945793 > }, > "events": [ > { > "time": "2021-04-21 16:40:33.871565", > "event": "initiated" > }, > { > "time": "2021-04-21 16:40:33.871565", > "event": "header_read" > }, > { > "time": "2021-04-21 16:40:33.871572", > "event": "throttled" > }, > { > "time": "2021-04-21 16:40:33.871617", > "event": "all_read" > }, > { > "time": "2021-04-21 16:40:33.871699", > "event": "dispatched" > }, > { > "time": "2021-04-21 16:40:33.871909", > "event": "failed to authpin, subtree is being > exported" > } > ] > } > }, > { > "description": "client_request(client.9801215:348424659 lookup > #0x10004cd0675/slaves 2021-04-21 15:50:01.592755 caller_uid=1000, > caller_gid=1000{})", > "initiated_at": "2021-04-21 15:50:01.595195", > "age": 3809.5708340870001, > "duration": 3809.5713271260001, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348424659", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348424659 > }, > "events": [ > { > "time": "2021-04-21 15:50:01.595195", > "event": "initiated" > }, > { > "time": "2021-04-21 15:50:01.595195", > "event": "header_read" > }, > { > "time": "2021-04-21 15:50:01.595198", > "event": "throttled" > }, > { > "time": "2021-04-21 15:50:01.595210", > "event": "all_read" > }, > { > "time": "2021-04-21 15:50:01.595307", > "event": "dispatched" > }, > { > "time": "2021-04-21 15:50:01.595389", > "event": "failed to authpin, subtree is being > exported" > }, > { > "time": "2021-04-21 16:01:40.084444", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084448", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9801215:348456034 > setfilelock rule 1, type 4, owner 14336643483181153700, pid 32085, start 0, > length 0, wait 0 #0x1001e694623 20 > 21-04-21 15:56:20.849213 caller_uid=1000, caller_gid=1000{})", > "initiated_at": "2021-04-21 15:56:20.851803", > "age": 3430.3142257680001, > "duration": 3430.3148090620002, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348456034", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348456034 > }, > "events": [ > { > "time": "2021-04-21 15:56:20.851803", > "event": "initiated" > }, > { > "time": "2021-04-21 15:56:20.851803", > "event": "header_read" > }, > { > "time": "2021-04-21 15:56:20.851815", > "event": "throttled" > }, > { > "time": "2021-04-21 15:56:20.851853", > "event": "all_read" > }, > { > "time": "2021-04-21 15:56:20.851983", > "event": "dispatched" > }, > { > "time": "2021-04-21 16:01:40.084452", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084454", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9801215:348456035 lookup > #0x10004cd0675/tasks 2021-04-21 15:56:20.849213 caller_uid=1000, > caller_gid=1000{})", > "initiated_at": "2021-04-21 15:56:20.851899", > "age": 3430.314129722, > "duration": 3430.3147965909998, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348456035", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348456035 > }, > "events": [ > { > "time": "2021-04-21 15:56:20.851899", > "event": "initiated" > }, > { > "time": "2021-04-21 15:56:20.851899", > "event": "header_read" > }, > { > "time": "2021-04-21 15:56:20.851900", > "event": "throttled" > }, > { > "time": "2021-04-21 15:56:20.851908", > "event": "all_read" > }, > { > "time": "2021-04-21 15:56:20.852165", > "event": "dispatched" > }, > { > "time": "2021-04-21 15:56:20.852254", > "event": "failed to authpin, subtree is being > exported" > }, > { > "time": "2021-04-21 16:01:40.084455", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084457", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9801215:348456036 lookup > #0x10004cd0675/slaves 2021-04-21 15:56:20.849213 caller_uid=1000, > caller_gid=1000{})", > "initiated_at": "2021-04-21 15:56:20.879287", > "age": 3430.2867415639998, > "duration": 3430.287496596, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348456036", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348456036 > }, > "events": [ > { > "time": "2021-04-21 15:56:20.879287", > "event": "initiated" > }, > { > "time": "2021-04-21 15:56:20.879287", > "event": "header_read" > }, > { > "time": "2021-04-21 15:56:20.879292", > "event": "throttled" > }, > { > "time": "2021-04-21 15:56:20.879306", > "event": "all_read" > }, > { > "time": "2021-04-21 15:56:20.879408", > "event": "dispatched" > }, > { > "time": "2021-04-21 15:56:20.879502", > "event": "failed to authpin, subtree is being > exported" > }, > { > "time": "2021-04-21 16:01:40.084458", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084460", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9801215:348456042 lookup > #0x10004cd0675/jul 2021-04-21 15:58:53.729411 caller_uid=1000, > caller_gid=1000{})", > "initiated_at": "2021-04-21 15:58:53.740318", > "age": 3277.4257106780001, > "duration": 3277.4265688219998, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348456042", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348456042 > }, > "events": [ > { > "time": "2021-04-21 15:58:53.740318", > "event": "initiated" > }, > { > "time": "2021-04-21 15:58:53.740318", > "event": "header_read" > }, > { > "time": "2021-04-21 15:58:53.740324", > "event": "throttled" > }, > { > "time": "2021-04-21 15:58:53.740345", > "event": "all_read" > }, > { > "time": "2021-04-21 15:58:53.743424", > "event": "dispatched" > }, > { > "time": "2021-04-21 15:58:53.750527", > "event": "failed to authpin, subtree is being > exported" > }, > { > "time": "2021-04-21 16:01:40.084465", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084468", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9143865:2383914038 > getattr pAsLsXsFs #0x10004f6a16b 2021-04-21 16:29:35.332632 caller_uid=903, > caller_gid=1003{})", > "initiated_at": "2021-04-21 16:29:35.333838", > "age": 1435.8321910940001, > "duration": 1435.833136707, > "type_data": { > "flag_point": "failed to authpin, subtree is being > exported", > "reqid": "client.9143865:2383914038", > "op_type": "client_request", > "client_info": { > "client": "client.9143865", > "tid": 2383914038 > }, > "events": [ > { > "time": "2021-04-21 16:29:35.333838", > "event": "initiated" > }, > { > "time": "2021-04-21 16:29:35.333838", > "event": "header_read" > }, > { > "time": "2021-04-21 16:29:35.333841", > "event": "throttled" > }, > { > "time": "2021-04-21 16:29:35.333854", > "event": "all_read" > }, > { > "time": "2021-04-21 16:29:35.335013", > "event": "dispatched" > }, > { > "time": "2021-04-21 16:29:35.335091", > "event": "failed to authpin, subtree is being > exported" > } > ] > } > }, > { > "description": "client_request(client.10439445:7 lookup > #0x10004cd0675/jul 2021-04-21 16:26:17.334272 caller_uid=1000, > caller_gid=1000{})", > "initiated_at": "2021-04-21 16:26:17.352448", > "age": 1633.813581226, > "duration": 1633.8146036109999, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.10439445:7", > "op_type": "client_request", > "client_info": { > "client": "client.10439445", > "tid": 7 > }, > "events": [ > { > "time": "2021-04-21 16:26:17.352448", > "event": "initiated" > }, > { > "time": "2021-04-21 16:26:17.352448", > "event": "header_read" > }, > { > "time": "2021-04-21 16:26:17.352450", > "event": "throttled" > }, > { > "time": "2021-04-21 16:26:17.352460", > "event": "all_read" > }, > { > "time": "2021-04-21 16:26:17.356901", > "event": "dispatched" > }, > { > "time": "2021-04-21 16:26:17.357337", > "event": "failed to authpin, subtree is being > exported" > }, > { > "time": "2021-04-21 16:38:08.054109", > "event": "killing request" > }, > { > "time": "2021-04-21 16:38:08.054120", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9801215:348456041 lookup > #0x10004cd0675/jul 2021-04-21 15:56:23.089216 caller_uid=1000, > caller_gid=1000{})", > "initiated_at": "2021-04-21 15:56:23.091531", > "age": 3428.0744974170002, > "duration": 3428.0756079930002, > "type_data": { > "flag_point": "cleaned up request", > "reqid": "client.9801215:348456041", > "op_type": "client_request", > "client_info": { > "client": "client.9801215", > "tid": 348456041 > }, > "events": [ > { > "time": "2021-04-21 15:56:23.091531", > "event": "initiated" > }, > { > "time": "2021-04-21 15:56:23.091531", > "event": "header_read" > }, > { > "time": "2021-04-21 15:56:23.091536", > "event": "throttled" > }, > { > "time": "2021-04-21 15:56:23.091555", > "event": "all_read" > }, > { > "time": "2021-04-21 15:56:23.091660", > "event": "dispatched" > }, > { > "time": "2021-04-21 15:56:23.091763", > "event": "failed to authpin, subtree is being > exported" > }, > { > "time": "2021-04-21 16:01:40.084461", > "event": "killing request" > }, > { > "time": "2021-04-21 16:01:40.084465", > "event": "cleaned up request" > } > ] > } > }, > { > "description": "client_request(client.9143865:2383885743 > getattr pAsLsXsFs #0x10004f6a16b 2021-04-21 16:26:36.713371 caller_uid=903, > caller_gid=1003{})", > "initiated_at": "2021-04-21 16:26:36.716418", > "age": 1614.4496113350001, > "duration": 1614.4508114350001, > "type_data": { > "flag_point": "failed to authpin, subtree is being > exported", > "reqid": "client.9143865:2383885743", > "op_type": "client_request", > "client_info": { > "client": "client.9143865", > "tid": 2383885743 > }, > "events": [ > { > "time": "2021-04-21 16:26:36.716418", > "event": "initiated" > }, > { > "time": "2021-04-21 16:26:36.716418", > "event": "header_read" > }, > { > "time": "2021-04-21 16:26:36.716421", > "event": "throttled" > }, > { > "time": "2021-04-21 16:26:36.716435", > "event": "all_read" > }, > { > "time": "2021-04-21 16:26:36.716526", > "event": "dispatched" > }, > { > "time": "2021-04-21 16:26:36.716632", > "event": "failed to authpin, subtree is being > exported" > } > ] > } > } > ], > "num_ops": 14 > } > > > > > > > -- > Flemming Frandsen - YAPH - http://osaa.dk - http://dren.dk/ > -- Flemming Frandsen - YAPH - http://osaa.dk - http://dren.dk/ _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx