Short story: ------------ We have a new HPC installation with file systems provided by cephfs (home, apps, ...). We have one cephfs and all client file systems are sub-directory mounts. On this ceph file system, we have a bit more than 500 nodes with currently 2 ceph fs mounts each, resulting in a bit over 1000 client connections. Today, we run a benchmark for the HPC nodes, which somehow crashed our MDS setup as a by-product. We are currently on "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)". We have the fs meta-data on SSD and the data on an EC data-pool on HDD. The ceph servers are running on kernel v5.0 while the clients are on v3.10. It looks like all important ceph fs kernel client fixes are back-ported (not all though). Everything on mimic. If necessary, we will provide more details, which we omitted here in favour of focus on the quite long event log below. What we observed after starting the benchmark was between 10000-20000 read OPs, probably caused by the application (running on all 500+ nodes) repeatedly accessing the same or few same files. For a while, this more or less worked until operations started piling up, which never got flushed. After quite a bit of work, it looks like we got our cluster back as well as discovering a (serious?) problem with an MDS daemon pair set-up as active/standby-replay. The main questions in this e-mail are basically (please see long story below for the clue we found): 1) Is there a bug with having MDS daemons acting as standby-replay? 2) Is there a manual way to cancel an "internal op fragmentdir"? Our MDS configuration section looked like this before the incident: [mds.ceph-08] mds standby replay = true mds standby for rank = 0 mds standby for fscid = 1 [mds.ceph-12] mds standby replay = true mds standby for rank = 0 mds standby for fscid = 1 We changed it to this one until further information on the issue: [mds.ceph-08] mds standby replay = false #mds standby for rank = 0 #mds standby for fscid = 1 [mds.ceph-12] mds standby replay = false #mds standby for rank = 0 #mds standby for fscid = 1 Long Story: ----------- Disclaimer: I was a bit under stress getting our cluster back. Unfortunately, I did not take copies of all potentially relevant information and have to recall some from memory. We are still in a phase where we can execute tests to collect more data - if necessary. We will certainly stress test the active/standby configuration with the same benchmark. It looks like there might be a bug where the standby-replay daemon is regarded as active and a fragmentdir operation becomes scheduled, but never executed. After this internal request is in the pipeline, operations keep piling up without ever being executed. The IO to files accessed by the benchmark stopped. Here is a protocol of what we observed and did: - We started the benchmark and observed large IOPs (10-20K ops read) with low MB/s (300-500MB/s read). Very little write going on. - At some point we started seeing the health warning 1 MDSs report slow requests and a large number of OPS blocked (NNNN-some 4 digit number): MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-12(mds.0): NNNN slow requests are blocked > 30 secs with the number NNNN increasing fast. - Some time later, we see trimming messages in addition: MDS_TRIM 2 MDSs behind on trimming mdsceph-12(mds.0): Behind on trimming (292/128) max_segments: 128, num_segments: 292 mdsceph-08(mds.0): Behind on trimming (292/128) max_segments: 128, num_segments: 292 with num_segments increasing slowly. - That's when we started taking action since we noted that the old ops were never executed and ops would probably just keep piling up. - We cancelled the benchmark job that was causing the IO load and waited to see if ceph would start recovering. I should note that this problem was already triggered by earlier runs of the same benchmark and the health issues did disappear after some time back then. However, in those cases the user terminated the job as soon as he noticed strange behaviour - much much earlier than in the supervised test described here, where we wanted to see how ceph acts under a sustained load of this type (users do have job accidents:). - We observed that this time, ceph was not getting better. The number of stuck operations was remaining at this level: MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-12(mds.0): 12051 slow requests are blocked > 30 secs MDS_TRIM 2 MDSs behind on trimming mdsceph-12(mds.0): Behind on trimming (292/128) max_segments: 128, num_segments: 292 mdsceph-08(mds.0): Behind on trimming (292/128) max_segments: 128, num_segments: 292 - So, we had to do something. After trying a number of less invasive things, like remounting the cepf fs on a client node, to no avail, we decided to take more drastic measures. - We evicted all clients and observed that more and more OPS got marked as "cancelled" and "cleaned up". During the eviction process, these warnings came up temporarily: mdsceph-12(mds.0): Client sn371.hpc.ait.dtu.dk:con-fs-hpc failing to respond to capability release client_id: 226263 mdsceph-12(mds.0): Client sn322.hpc.ait.dtu.dk:con-fs-hpc failing to respond to capability release client_id: 242672 mdsceph-12(mds.0): Client sn339.hpc.ait.dtu.dk:con-fs-hpc failing to respond to capability release client_id: 242708 mdsceph-12(mds.0): Client sn335.hpc.ait.dtu.dk:con-fs-hpc failing to respond to capability release client_id: 272728 I'm aware that there is/was a kernel (client) bug that might still be present in 3.10. However, I have not seen this warning in any other context so far. I did test pushing clients to release caps by changing the MDS cache memory limit at run time and clients responded with dropping caps as needed. I have never seen the MDS exceeding the cache memory limit. - After kicking out all clients, we ended up with this result: HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-12(mds.0): 12051 slow requests are blocked > 30 secs MDS_TRIM 2 MDSs behind on trimming mdsceph-12(mds.0): Behind on trimming (294/128) max_segments: 128, num_segments: 294 mdsceph-08(mds.0): Behind on trimming (294/128) max_segments: 128, num_segments: 294 [root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep description | wc -l 12051 [root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep killing | wc -l 12050 So, there was 1 stuck operation that was not client related. I started filtering the descriptions with "grep -v" and got a hit: [root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep "internal op fragmentdir" -A 100 "description": "internal op fragmentdir:mds.0:169119", "initiated_at": "2019-05-13 11:17:52.002507", "age": 12796.457152, "duration": 12796.666273, "type_data": { "flag_point": "dispatched", "reqid": "mds.0:169119", "op_type": "internal_op", "internal_op": 5376, "op_name": "fragmentdir", "events": [ { "time": "2019-05-13 11:17:52.002507", "event": "initiated" }, { "time": "2019-05-13 11:17:52.002507", "event": "header_read" }, { "time": "0.000000", "event": "throttled" }, { "time": "0.000000", "event": "all_read" }, { "time": "0.000000", "event": "dispatched" } ] } }, The information I found searching the web indicates that this is something that happens in an active/active set-up, where one active MDS requests another active MDS to take over part of the meta data operations. In our case, there was no MDS to accept that request. Our hypothesis is, that the standby-replay daemon erroneously was seen as active. I should add here that (all?) the blocked operations had as their last entry something like "waiting for directory fragmentation" or "pending ...". I forgot to copy the message and have to recall that from memory. In any case, I suspected something like the above operation being stuck in the queue holding everything up already a bit earlier. - Since all other OPS were cancelled, we assumed it would be safe at this point to restart MDS daemons without loosing relevant data. The idea to get rid of the pending dirfrag was to restart the standby-replay daemon as standby (only) and see what happens. If things looked good, we would fail over the active daemon and restart this one also with standby (only) configuration as indicated above with the modified MDS section in ceph.conf. Indeed, this procedure flushed the dirfrag out of the queue and all other operations with it. The cluster came back healthy and I do not think we lost any relevant IO. - I should mention that ceph fs actually kept working fine for anyone except the user running the benchmark. Only IO to particular files/a particular directory stopped, so this problem seems to remain isolated. Also, the load on the servers was not high during the test. The fs remained responsive to other users. Also, the MDS daemons never crashed. There was no fail-over except the ones we triggered manually. As mentioned above, we can do some more testing within reason. We already have pilot users on the system and we need to keep it sort of working. Best regards and thanks in advance for any help on getting a stable active/standby-replay config working. And for reading all that. ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com