This looks awkward — just from the ops, it seems mds.1 is trying to move some stray items (presumably snapshots of since-deleted files, from what you said?) into mds0's stray directory, and then mds.0 tries to get auth pins from mds.1 but that fails for some reason which isn't apparent from the dump. Somebody might be able to get farther along by tracing logs of mds.1 rebooting, but my guess is that rebooting both servers will clear it up. You might also try increasing max_mds to 2 and seeing if that jogs things loose; I'm not sure what would be less disruptive for you. -Greg On Mon, Dec 13, 2021 at 5:37 AM Frank Schilder <frans@xxxxxx> wrote: > > Hi all, I needed to reduce the number of active MDS daemons from 4 to 1. Unfortunately, the last MDS to stop is stuck in stopping state. Ceph version is mimic 13.2.10. Each MDS has 3 blocked OPS, that seem to be related to deleted snapshots; more info below. I failed the MDS in stopping state already several times in the hope that the operations get flushed out. Before failing rank 0, I would appreciate if someone could look at this issue and advise on how to proceed safely. > > Some diagnostic info: > > # ceph fs status > con-fs2 - 1659 clients > ======= > +------+----------+---------+---------------+-------+-------+ > | Rank | State | MDS | Activity | dns | inos | > +------+----------+---------+---------------+-------+-------+ > | 0 | active | ceph-08 | Reqs: 176 /s | 2844k | 2775k | > | 1 | stopping | ceph-17 | | 27.7k | 59 | > +------+----------+---------+---------------+-------+-------+ > +---------------------+----------+-------+-------+ > | Pool | type | used | avail | > +---------------------+----------+-------+-------+ > | con-fs2-meta1 | metadata | 555M | 1261G | > | con-fs2-meta2 | data | 0 | 1261G | > | con-fs2-data | data | 1321T | 5756T | > | con-fs2-data-ec-ssd | data | 252G | 4035G | > | con-fs2-data2 | data | 389T | 5233T | > +---------------------+----------+-------+-------+ > +-------------+ > | Standby MDS | > +-------------+ > | ceph-09 | > | ceph-24 | > | ceph-14 | > | ceph-16 | > | ceph-12 | > | ceph-23 | > | ceph-10 | > | ceph-15 | > | ceph-13 | > | ceph-11 | > +-------------+ > MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable) > > # ceph status > cluster: > id: > health: HEALTH_WARN > 2 MDSs report slow requests > > services: > mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 > mgr: ceph-01(active), standbys: ceph-02, ceph-03, ceph-25, ceph-26 > mds: con-fs2-2/2/1 up {0=ceph-08=up:active,1=ceph-17=up:stopping}, 10 up:standby > osd: 1051 osds: 1050 up, 1050 in > > data: > pools: 13 pools, 17374 pgs > objects: 1.01 G objects, 1.9 PiB > usage: 2.3 PiB used, 9.2 PiB / 11 PiB avail > pgs: 17352 active+clean > 20 active+clean+scrubbing+deep > 2 active+clean+scrubbing > > io: > client: 129 MiB/s rd, 175 MiB/s wr, 2.57 kop/s rd, 2.77 kop/s wr > > # ceph health detail > HEALTH_WARN 2 MDSs report slow requests > MDS_SLOW_REQUEST 2 MDSs report slow requests > mdsceph-08(mds.0): 3 slow requests are blocked > 30 secs > mdsceph-17(mds.1): 3 slow requests are blocked > 30 secs > > # ssh ceph-08 ceph daemon mds.ceph-08 dump_blocked_ops > { > "ops": [ > { > "description": "client_request(mds.1:126521 rename #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, caller_gid=0{})", > "initiated_at": "2021-12-13 13:08:59.430597", > "age": 5034.983083, > "duration": 5034.983109, > "type_data": { > "flag_point": "acquired locks", > "reqid": "mds.1:126521", > "op_type": "client_request", > "client_info": { > "client": "mds.1", > "tid": 126521 > }, > "events": [ > { > "time": "2021-12-13 13:08:59.430597", > "event": "initiated" > }, > { > "time": "2021-12-13 13:08:59.430597", > "event": "header_read" > }, > { > "time": "2021-12-13 13:08:59.430597", > "event": "throttled" > }, > { > "time": "2021-12-13 13:08:59.430601", > "event": "all_read" > }, > { > "time": "2021-12-13 13:09:00.730197", > "event": "dispatched" > }, > { > "time": "2021-12-13 13:09:01.517306", > "event": "requesting remote authpins" > }, > { > "time": "2021-12-13 13:09:01.557219", > "event": "failed to xlock, waiting" > }, > { > "time": "2021-12-13 13:09:01.647692", > "event": "failed to wrlock, waiting" > }, > { > "time": "2021-12-13 13:09:01.663629", > "event": "waiting for remote wrlocks" > }, > { > "time": "2021-12-13 13:09:01.673789", > "event": "waiting for remote wrlocks" > }, > { > "time": "2021-12-13 13:09:01.676523", > "event": "failed to xlock, waiting" > }, > { > "time": "2021-12-13 13:09:01.691962", > "event": "failed to xlock, waiting" > }, > { > "time": "2021-12-13 13:09:01.704202", > "event": "acquired locks" > } > ] > } > }, > { > "description": "client_request(mds.1:1 rename #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, caller_gid=0{})", > "initiated_at": "2021-12-13 13:31:56.260453", > "age": 3658.153227, > "duration": 3658.153337, > "type_data": { > "flag_point": "requesting remote authpins", > "reqid": "mds.1:1", > "op_type": "client_request", > "client_info": { > "client": "mds.1", > "tid": 1 > }, > "events": [ > { > "time": "2021-12-13 13:31:56.260453", > "event": "initiated" > }, > { > "time": "2021-12-13 13:31:56.260453", > "event": "header_read" > }, > { > "time": "2021-12-13 13:31:56.260454", > "event": "throttled" > }, > { > "time": "2021-12-13 13:31:56.260461", > "event": "all_read" > }, > { > "time": "2021-12-13 13:31:56.260511", > "event": "dispatched" > }, > { > "time": "2021-12-13 13:31:56.260604", > "event": "requesting remote authpins" > } > ] > } > }, > { > "description": "client_request(mds.1:993 rename #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, caller_gid=0{})", > "initiated_at": "2021-12-13 13:15:31.979997", > "age": 4642.433683, > "duration": 4642.433850, > "type_data": { > "flag_point": "requesting remote authpins", > "reqid": "mds.1:993", > "op_type": "client_request", > "client_info": { > "client": "mds.1", > "tid": 993 > }, > "events": [ > { > "time": "2021-12-13 13:15:31.979997", > "event": "initiated" > }, > { > "time": "2021-12-13 13:15:31.979997", > "event": "header_read" > }, > { > "time": "2021-12-13 13:15:31.979998", > "event": "throttled" > }, > { > "time": "2021-12-13 13:15:31.980003", > "event": "all_read" > }, > { > "time": "2021-12-13 13:15:31.980079", > "event": "dispatched" > }, > { > "time": "2021-12-13 13:15:31.980174", > "event": "requesting remote authpins" > }, > { > "time": "2021-12-13 13:31:50.634734", > "event": "requesting remote authpins" > } > ] > } > } > ], > "complaint_time": 30.000000, > "num_blocked_ops": 3 > } > > # ssh ceph-17 ceph daemon mds.ceph-17 dump_blocked_ops > { > "ops": [ > { > "description": "rejoin:mds.1:126521", > "initiated_at": "2021-12-13 13:30:34.602931", > "age": 3791.164314, > "duration": 3791.164335, > "type_data": { > "flag_point": "dispatched", > "reqid": "mds.1:126521", > "op_type": "no_available_op_found", > "events": [ > { > "time": "2021-12-13 13:30:34.602931", > "event": "initiated" > }, > { > "time": "2021-12-13 13:30:34.602931", > "event": "header_read" > }, > { > "time": "2021-12-13 13:30:34.602932", > "event": "throttled" > }, > { > "time": "2021-12-13 13:30:34.602978", > "event": "all_read" > }, > { > "time": "2021-12-13 13:30:34.605856", > "event": "dispatched" > } > ] > } > }, > { > "description": "slave_request(mds.1:993.0 authpin)", > "initiated_at": "2021-12-13 13:31:50.634857", > "age": 3715.132388, > "duration": 3715.132451, > "type_data": { > "flag_point": "dispatched", > "reqid": "mds.1:993", > "op_type": "slave_request", > "master_info": { > "master": "mds.0" > }, > "request_info": { > "attempt": 0, > "op_type": "authpin", > "lock_type": 0, > "object_info": "0x1000eec35f7.head", > "srcdnpath": "", > "destdnpath": "", > "witnesses": "", > "has_inode_export": false, > "inode_export_v": 0, > "op_stamp": "0.000000" > }, > "events": [ > { > "time": "2021-12-13 13:31:50.634857", > "event": "initiated" > }, > { > "time": "2021-12-13 13:31:50.634857", > "event": "header_read" > }, > { > "time": "2021-12-13 13:31:50.634858", > "event": "throttled" > }, > { > "time": "2021-12-13 13:31:50.634867", > "event": "all_read" > }, > { > "time": "2021-12-13 13:31:50.634893", > "event": "dispatched" > } > ] > } > }, > { > "description": "slave_request(mds.1:1.0 authpin)", > "initiated_at": "2021-12-13 13:31:56.260729", > "age": 3709.506516, > "duration": 3709.506631, > "type_data": { > "flag_point": "dispatched", > "reqid": "mds.1:1", > "op_type": "slave_request", > "master_info": { > "master": "mds.0" > }, > "request_info": { > "attempt": 0, > "op_type": "authpin", > "lock_type": 0, > "object_info": "0x1000eec35f7.head", > "srcdnpath": "", > "destdnpath": "", > "witnesses": "", > "has_inode_export": false, > "inode_export_v": 0, > "op_stamp": "0.000000" > }, > "events": [ > { > "time": "2021-12-13 13:31:56.260729", > "event": "initiated" > }, > { > "time": "2021-12-13 13:31:56.260729", > "event": "header_read" > }, > { > "time": "2021-12-13 13:31:56.260731", > "event": "throttled" > }, > { > "time": "2021-12-13 13:31:56.260743", > "event": "all_read" > }, > { > "time": "2021-12-13 13:31:56.264063", > "event": "dispatched" > } > ] > } > } > ], > "complaint_time": 30.000000, > "num_blocked_ops": 3 > } > > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx