John Spray <jspray@...> writes: > Anyway -- you'll need to do some local poking of the MDS to work out > what the hold up is. Turn up MDS debug logging[1] and see what's > it's saying during the replay. Also, you can use performance counters > "ceph daemon mds.<id> perf dump" and see which are incrementing to get > an idea of what it's doing. The "rd_pos" value from "perf dump > mds_log" should increment during replay. If you haven't already, also > check the overall health of the MDS host, e.g. is it low on > memory/swapping? > > John > > 1. http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/#runtime-changes > Hi John, Thanks for your help. There doesn't seem to be anything obvious wrong with any of the three MDSes. They're idle and have plenty of free memory and disk space. I briefly stopped the MDS that was active, and one of the others took over and resumed replaying. Looking at the "rd_pos" on the active MDS, it's sitting at a large number (226119121516) and hasn't changed in ten minutes or so. I've turned on debugging with "ceph tell mds.0 injectargs --debug-mds 20 --debug-ms 1 --debug-journaler=10", and here's a sample of what I see in the MDS log for the active MDS: 2015-12-15 13:21:55.507212 7fbe08aa2700 10 mds.beacon.2 _send up:replay seq 1525 2015-12-15 13:21:55.507238 7fbe08aa2700 1 -- 192.168.1.33:6800/13115 --> 192.168.1.33:6789/0 -- mdsbeacon(46356419/2 up:replay seq 1525 v78698) v3 -- ?+0 0x4e0f800 con 0x4118dc0 2015-12-15 13:21:55.508577 7fbe0c3aa700 1 -- 192.168.1.33:6800/13115 <== mon.2 192.168.1.33:6789/0 1592 ==== mdsbeacon(46356419/2 up:replay seq 1525 v78698) v3 ==== 113+0+0 (2137559923 0 0) 0x7e06300 con 0x4118dc0 2015-12-15 13:21:55.508608 7fbe0c3aa700 10 mds.beacon.2 handle_mds_beacon up:replay seq 1525 rtt 0.001382 2015-12-15 13:21:59.484680 7fbe092a3700 10 MDSInternalContextBase::complete: N3MDS10C_MDS_TickE 2015-12-15 13:21:59.484740 7fbe092a3700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.13> 2015-12-15 13:21:59.507303 7fbe08aa2700 10 mds.beacon.2 _send up:replay seq 1526 2015-12-15 13:21:59.507329 7fbe08aa2700 1 -- 192.168.1.33:6800/13115 --> 192.168.1.33:6789/0 -- mdsbeacon(46356419/2 up:replay seq 1526 v78698) v3 -- ?+0 0x4e0fb00 con 0x4118dc0 2015-12-15 13:21:59.508536 7fbe0c3aa700 1 -- 192.168.1.33:6800/13115 <== mon.2 192.168.1.33:6789/0 1593 ==== mdsbeacon(46356419/2 up:replay seq 1526 v78698) v3 ==== 113+0+0 (4265335703 0 0) 0x7e06000 con 0x4118dc0 2015-12-15 13:21:59.508581 7fbe0c3aa700 10 mds.beacon.2 handle_mds_beacon up:replay seq 1526 rtt 0.001264 2015-12-15 13:21:59.510012 7fbe0cbab700 1 -- 192.168.1.33:6800/13115 --> 192.168.1.22:6812/5403 -- ping magic: 0 v1 -- ?+0 0x417e900 con 0x4c60580 2015-12-15 13:22:03.507394 7fbe08aa2700 10 mds.beacon.2 _send up:replay seq 1527 2015-12-15 13:22:03.507420 7fbe08aa2700 1 -- 192.168.1.33:6800/13115 --> 192.168.1.33:6789/0 -- mdsbeacon(46356419/2 up:replay seq 1527 v78698) v3 -- ?+0 0x4de8000 con 0x4118dc0 2015-12-15 13:22:03.508767 7fbe0c3aa700 1 -- 192.168.1.33:6800/13115 <== mon.2 192.168.1.33:6789/0 1594 ==== mdsbeacon(46356419/2 up:replay seq 1527 v78698) v3 ==== 113+0+0 (2164974539 0 0) 0x7e05d00 con 0x4118dc0 2015-12-15 13:22:03.508799 7fbe0c3aa700 10 mds.beacon.2 handle_mds_beacon up:replay seq 1527 rtt 0.001390 2015-12-15 13:22:04.484781 7fbe092a3700 10 MDSInternalContextBase::complete: N3MDS10C_MDS_TickE 2015-12-15 13:22:04.484841 7fbe092a3700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.12> 2015-12-15 13:22:04.510142 7fbe0cbab700 1 -- 192.168.1.33:6800/13115 --> 192.168.1.22:6812/5403 -- ping magic: 0 v1 -- ?+0 0x417eac0 con 0x4c60580 2015-12-15 13:22:07.507485 7fbe08aa2700 10 mds.beacon.2 _send up:replay seq 1528 2015-12-15 13:22:07.507511 7fbe08aa2700 1 -- 192.168.1.33:6800/13115 --> 192.168.1.33:6789/0 -- mdsbeacon(46356419/2 up:replay seq 1528 v78698) v3 -- ?+0 0x4de8300 con 0x4118dc0 2015-12-15 13:22:07.508757 7fbe0c3aa700 1 -- 192.168.1.33:6800/13115 <== mon.2 192.168.1.33:6789/0 1595 ==== mdsbeacon(46356419/2 up:replay seq 1528 v78698) v3 ==== 113+0+0 (248276317 0 0) 0x7e05a00 con 0x4118dc0 2015-12-15 13:22:07.508788 7fbe0c3aa700 10 mds.beacon.2 handle_mds_beacon up:replay seq 1528 rtt 0.001288 2015-12-15 13:22:09.484881 7fbe092a3700 10 MDSInternalContextBase::complete: N3MDS10C_MDS_TickE 2015-12-15 13:22:09.484957 7fbe092a3700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.11> 2015-12-15 13:22:09.510286 7fbe0cbab700 1 -- 192.168.1.33:6800/13115 --> 192.168.1.22:6812/5403 -- ping magic: 0 v1 -- ?+0 0x417ec80 con 0x4c60580 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com