A rogue process wrote 38M files into a single CephFS directory that took about a month to delete. We had to increase MDS cache sizes to handle the increased file volume, but we've been able to reduce all of our settings back to default. Ceph cluster is 15.2.11. Cephfs clients are ceph-fuse either version 14.2.16 or 15.2.11 depending if they've been upgraded yet. Nothing has changed in the last ~6 months in regards to client versions or cluster version. We are currently dealing with 2 issues now that things seem to be cleaned up. 1. MDSs report slow requests. [1] Dumping the blocked requests has the same output for all of them. They seemingly get stuck AFTER the event succeeds to acquire locks. I can't find any information about what's happening after this or why things are getting stuck here. 2. Clients failing to advance oldest client/flush tid. There are 2 clients that are the worst offenders for this, but a few other clients are having this same issue. All of the clients having this issue are on 14.2.16, but we also have a hundred clients on the same version that don't have this issue at all. [2] The logs make it look like the clients just have a bad integer/pointer somehow. We can clean up the error by remounting the filesystem or rebooting the server, but these 2 clients in particular keep ending up in this state. No other repeat offenders yet, but we've had 4 other servers in this state over the last couple weeks. Are there any ideas what the next steps might be for diagnosing either of these issues? Thank you. -David Turner [1] $ sudo ceph daemon mds.mon1 dump_blocked_ops { "ops": [ { "description": "client_request(client.17709580:39254 open #0x10001c99cd4 2022-02-22T16:25:40.231547+0000 caller_uid=0, caller_gid=0{})", "initiated_at": "2022-04-19T19:07:10.663552+0000", "age": 90.920778446, "duration": 90.920806244000005, "type_data": { "flag_point": "acquired locks", "reqid": "client.17709580:39254", "op_type": "client_request", "client_info": { "client": "client.17709580", "tid": 39254 }, "events": [ { "time": "2022-04-19T19:07:10.663552+0000", "event": "initiated" }, { "time": "2022-04-19T19:07:10.663549+0000", "event": "throttled" }, { "time": "2022-04-19T19:07:10.663552+0000", "event": "header_read" }, { "time": "2022-04-19T19:07:10.663555+0000", "event": "all_read" }, { "time": "2022-04-19T19:07:10.665744+0000", "event": "dispatched" }, { "time": "2022-04-19T19:07:10.773894+0000", "event": "failed to xlock, waiting" }, { "time": "2022-04-19T19:07:10.807249+0000", "event": "acquired locks" } ] } }, [2] 2022-04-19 06:15:36.108 7fb28b7fe700 0 client.30095002 handle_cap_flush_ack mds.1 got unexpected flush ack tid 338611 expected is 0 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx