Re: mimic: MDS standby-replay causing blocked ops (MDS bug?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Stefan, cc Yan,

thanks for your quick reply.

> I am pretty sure you hit bug #26982: https://tracker.ceph.com/issues/26982 "mds: crash when dumping ops in flight".

Everything is fine, the daemon did not crash. The dump cache operation seems to be a blocking operation. It simply blocked the MDS on ceph-08 for too long and the mons decided to flip to the MDS on ceph-12. The MDS on ceph-08 is up for almost 5 days:

[root@ceph-mds:ceph-08 /]# ps -e -o pid,etime,cmd
    PID     ELAPSED CMD
      1  4-21:03:44 /bin/bash /entrypoint.sh mds
    190  4-21:03:43 /usr/bin/ceph-mds --cluster ceph --setuser ceph --setgroup ceph -d -i ceph-08
  31344       02:42 /bin/bash
  31364       00:00 ps -e -o pid,etime,cmd

The relevant section from the syslog is (filtered by 'grep -i mds'):

May 18 10:20:45 ceph-08 journal: 2019-05-18 08:20:45.400 7f1c99552700  1 mds.ceph-08 asok_command: dump cache (starting...)
May 18 10:20:45 ceph-08 journal: 2019-05-18 08:20:45.400 7f1c99552700  1 mds.0.cache dump_cache to /var/log/ceph/mds-case/cache
May 18 10:20:51 ceph-01 journal: cluster 2019-05-18 08:20:44.135690 mds.ceph-08 mds.0 192.168.32.72:6800/314672380 2554 : cluster 
[WRN] 7 slow requests, 0 included below; oldest blocked for > 1931.724397 secs
May 18 10:20:51 ceph-03 journal: cluster 2019-05-18 08:20:44.135690 mds.ceph-08 mds.0 192.168.32.72:6800/314672380 2554 : cluster 
[WRN] 7 slow requests, 0 included below; oldest blocked for > 1931.724397 secs
May 18 10:20:51 ceph-02 journal: cluster 2019-05-18 08:20:44.135690 mds.ceph-08 mds.0 192.168.32.72:6800/314672380 2554 : cluster 
[WRN] 7 slow requests, 0 included below; oldest blocked for > 1931.724397 secs
May 18 10:21:01 ceph-08 journal: 2019-05-18 08:21:01.414 7f1c952c1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:01 ceph-08 journal: 2019-05-18 08:21:01.414 7f1c952c1700  0 mds.beacon.ceph-08 _send skipping beacon, heartbeat map not healthy
May 18 10:21:03 ceph-08 journal: 2019-05-18 08:21:03.549 7f1c99d53700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:05 ceph-08 journal: 2019-05-18 08:21:05.414 7f1c952c1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:05 ceph-08 journal: 2019-05-18 08:21:05.414 7f1c952c1700  0 mds.beacon.ceph-08 _send skipping beacon, heartbeat map not healthy
May 18 10:21:08 ceph-08 journal: 2019-05-18 08:21:08.549 7f1c99d53700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:09 ceph-08 journal: 2019-05-18 08:21:09.415 7f1c952c1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
May 18 10:21:09 ceph-08 journal: 2019-05-18 08:21:09.415 7f1c952c1700  0 mds.beacon.ceph-08 _send skipping beacon, heartbeat map not healthy
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.021 7f38552b8700  1 mon.ceph-01@0(leader).mds e16312 no beacon from mds.0.15942 (gid: 327273 addr: 192.168.32.72:6800/314672380 state: up:active) since 15.6064s
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.021 7f38552b8700  1 mon.ceph-01@0(leader).mds e16312  replacing 327273 192.168.32.72:6800/314672380mds.0.15942 up:active with 457451/ceph-12 192.168.32.76:6800/3202682100
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.021 7f38552b8700  0 log_channel(cluster) log [WRN] : daemon mds.ceph-08 is not responding, replacing it as rank 0 with standby daemon mds.ceph-12
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.021 7f38552b8700  1 mon.ceph-01@0(leader).mds e16312 fail_mds_gid 327273 mds.ceph-08 role 0
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.038 7f38552b8700  0 log_channel(cluster) log [WRN] : Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.038 7f38552b8700  0 log_channel(cluster) log [INF] : Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.105 7f384eaab700  0 mon.ceph-01@0(leader).mds e16313 new map
May 18 10:21:13 ceph-01 journal: debug 2019-05-18 08:21:13.105 7f384eaab700  0 mon.ceph-01@0(leader).mds e16313 print_map
May 18 10:21:13 ceph-01 journal: compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}

Sorry, I should have checked this first.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux