Re: MDS stuck replaying

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



John Spray <jspray@...> writes:

> Anyway -- you'll need to do some local poking of the MDS to work out
> what the hold up is.   Turn up MDS debug logging[1] and see what's
> it's saying during the replay.  Also, you can use performance counters
> "ceph daemon mds.<id> perf dump" and see which are incrementing to get
> an idea of what it's doing.  The "rd_pos" value from "perf dump
> mds_log" should increment during replay.  If you haven't already, also
> check the overall health of the MDS host, e.g. is it low on
> memory/swapping?
> 
> John
> 
> 1.
http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/#runtime-changes
> 


Hi John,

Thanks for your help.  There doesn't seem to be anything obvious wrong with
any of the three MDSes.  They're idle and have plenty of free memory and
disk space.  I briefly stopped the MDS that was active, and one of the
others took over and resumed replaying.  Looking at the "rd_pos" on the
active MDS, it's sitting at a large number (226119121516) and hasn't changed
in ten minutes or so.

I've turned on debugging with "ceph tell mds.0 injectargs --debug-mds 20
--debug-ms 1 --debug-journaler=10", and here's a sample of what I see in the
MDS log for the active MDS:


2015-12-15 13:21:55.507212 7fbe08aa2700 10 mds.beacon.2 _send up:replay seq 1525
2015-12-15 13:21:55.507238 7fbe08aa2700  1 -- 192.168.1.33:6800/13115 -->
192.168.1.33:6789/0 -- mdsbeacon(46356419/2 up:replay seq 1525 v78698) v3 --
?+0 0x4e0f800 con 0x4118dc0
2015-12-15 13:21:55.508577 7fbe0c3aa700  1 -- 192.168.1.33:6800/13115 <==
mon.2 192.168.1.33:6789/0 1592 ==== mdsbeacon(46356419/2 up:replay seq 1525
v78698) v3 ==== 113+0+0 (2137559923 0 0) 0x7e06300 con 0x4118dc0
2015-12-15 13:21:55.508608 7fbe0c3aa700 10 mds.beacon.2 handle_mds_beacon
up:replay seq 1525 rtt 0.001382
2015-12-15 13:21:59.484680 7fbe092a3700 10 MDSInternalContextBase::complete:
N3MDS10C_MDS_TickE
2015-12-15 13:21:59.484740 7fbe092a3700 15 mds.0.bal get_load mdsload<[0,0
0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.13>
2015-12-15 13:21:59.507303 7fbe08aa2700 10 mds.beacon.2 _send up:replay seq 1526
2015-12-15 13:21:59.507329 7fbe08aa2700  1 -- 192.168.1.33:6800/13115 -->
192.168.1.33:6789/0 -- mdsbeacon(46356419/2 up:replay seq 1526 v78698) v3 --
?+0 0x4e0fb00 con 0x4118dc0
2015-12-15 13:21:59.508536 7fbe0c3aa700  1 -- 192.168.1.33:6800/13115 <==
mon.2 192.168.1.33:6789/0 1593 ==== mdsbeacon(46356419/2 up:replay seq 1526
v78698) v3 ==== 113+0+0 (4265335703 0 0) 0x7e06000 con 0x4118dc0
2015-12-15 13:21:59.508581 7fbe0c3aa700 10 mds.beacon.2 handle_mds_beacon
up:replay seq 1526 rtt 0.001264
2015-12-15 13:21:59.510012 7fbe0cbab700  1 -- 192.168.1.33:6800/13115 -->
192.168.1.22:6812/5403 -- ping magic: 0 v1 -- ?+0 0x417e900 con 0x4c60580
2015-12-15 13:22:03.507394 7fbe08aa2700 10 mds.beacon.2 _send up:replay seq 1527
2015-12-15 13:22:03.507420 7fbe08aa2700  1 -- 192.168.1.33:6800/13115 -->
192.168.1.33:6789/0 -- mdsbeacon(46356419/2 up:replay seq 1527 v78698) v3 --
?+0 0x4de8000 con 0x4118dc0
2015-12-15 13:22:03.508767 7fbe0c3aa700  1 -- 192.168.1.33:6800/13115 <==
mon.2 192.168.1.33:6789/0 1594 ==== mdsbeacon(46356419/2 up:replay seq 1527
v78698) v3 ==== 113+0+0 (2164974539 0 0) 0x7e05d00 con 0x4118dc0
2015-12-15 13:22:03.508799 7fbe0c3aa700 10 mds.beacon.2 handle_mds_beacon
up:replay seq 1527 rtt 0.001390
2015-12-15 13:22:04.484781 7fbe092a3700 10 MDSInternalContextBase::complete:
N3MDS10C_MDS_TickE
2015-12-15 13:22:04.484841 7fbe092a3700 15 mds.0.bal get_load mdsload<[0,0
0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.12>
2015-12-15 13:22:04.510142 7fbe0cbab700  1 -- 192.168.1.33:6800/13115 -->
192.168.1.22:6812/5403 -- ping magic: 0 v1 -- ?+0 0x417eac0 con 0x4c60580
2015-12-15 13:22:07.507485 7fbe08aa2700 10 mds.beacon.2 _send up:replay seq 1528
2015-12-15 13:22:07.507511 7fbe08aa2700  1 -- 192.168.1.33:6800/13115 -->
192.168.1.33:6789/0 -- mdsbeacon(46356419/2 up:replay seq 1528 v78698) v3 --
?+0 0x4de8300 con 0x4118dc0
2015-12-15 13:22:07.508757 7fbe0c3aa700  1 -- 192.168.1.33:6800/13115 <==
mon.2 192.168.1.33:6789/0 1595 ==== mdsbeacon(46356419/2 up:replay seq 1528
v78698) v3 ==== 113+0+0 (248276317 0 0) 0x7e05a00 con 0x4118dc0
2015-12-15 13:22:07.508788 7fbe0c3aa700 10 mds.beacon.2 handle_mds_beacon
up:replay seq 1528 rtt 0.001288
2015-12-15 13:22:09.484881 7fbe092a3700 10 MDSInternalContextBase::complete:
N3MDS10C_MDS_TickE
2015-12-15 13:22:09.484957 7fbe092a3700 15 mds.0.bal get_load mdsload<[0,0
0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.11>
2015-12-15 13:22:09.510286 7fbe0cbab700  1 -- 192.168.1.33:6800/13115 -->
192.168.1.22:6812/5403 -- ping magic: 0 v1 -- ?+0 0x417ec80 con 0x4c60580


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux