Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure. Now here is the result of `ceph –s` # ceph -s cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0 health HEALTH_OK monmap e1: 5 mons at {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0} election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191 mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 up:standby osdmap e118886: 192 osds: 192 up, 192 in pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects 69601 GB used, 37656 GB / 104 TB avail 11309 active+clean 13 active+clean+scrubbing 6 active+clean+scrubbing+deep And here are the ops in flight: # ceph daemon mds.a190 dump_ops_in_flight { "ops": [], "num_ops": 0 } And a tail of the active mds log at debug_mds 5/5 2016-09-21 20:15:53.354226 7fce3b626700 4 mds.0.server handle_client_request client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 20:15:53.352390) v2 2016-09-21 20:15:53.354234 7fce3b626700 5 mds.0.server session closed|closing|killing, dropping 2016-09-21 20:15:54.867108 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 235) v1 from client.507429717 2016-09-21 20:15:54.980907 7fce3851f700 2 mds.0.cache check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode 2016-09-21 20:15:54.980960 7fce3851f700 5 mds.0.bal mds.0 epoch 38 load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34> 2016-09-21 20:15:55.247885 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 233) v1 from client.538555196 2016-09-21 20:15:55.455566 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 365) v1 from client.507390467 2016-09-21 20:15:55.807704 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 367) v1 from client.538485341 2016-09-21 20:15:56.243462 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 189) v1 from client.538577596 2016-09-21 20:15:56.986901 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 232) v1 from client.507430372 2016-09-21 20:15:57.026206 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885158 2016-09-21 20:15:57.369281 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390682 2016-09-21 20:15:57.445687 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538485996 2016-09-21 20:15:57.579268 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538486021 2016-09-21 20:15:57.595568 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390702 2016-09-21 20:15:57.604356 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390712 2016-09-21 20:15:57.693546 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390717 2016-09-21 20:15:57.819536 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885168 2016-09-21 20:15:57.894058 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390732 2016-09-21 20:15:57.983329 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390742 2016-09-21 20:15:58.077915 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538486031 2016-09-21 20:15:58.141710 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885178 2016-09-21 20:15:58.159134 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885188 -Chris On 9/21/16, 11:23 AM, "Heller, Chris" <cheller@xxxxxxxxxx> wrote: Perhaps related, I was watching the active mds with debug_mds set to 5/5, when I saw this in the log: 2016-09-21 15:13:26.067698 7fbaec248700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0 c=0x5631ce0).fault with nothing to send, going to standby 2016-09-21 15:13:26.067717 7fbaf64ea700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e8420).fault with nothing to send, going to standby 2016-09-21 15:13:26.067725 7fbb0098e700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0 c=0x15c29020).fault with noth ing to send, going to standby 2016-09-21 15:13:26.067743 7fbb026ab700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e91e0).fault, server, going to standby 2016-09-21 15:13:26.067749 7fbae840a700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3290400005 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0 c=0x13b6c160).fault with not hing to send, going to standby 2016-09-21 15:13:26.067783 7fbadb239700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0 c=0x28e24f20).fault with noth ing to send, going to standby 2016-09-21 15:13:26.067803 7fbafe66b700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e9760).fault with not hing to send, going to standby 2016-09-21 15:13:26.067876 7fbb01a9f700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92f5a0).fault with not hing to send, going to standby 2016-09-21 15:13:26.067886 7fbb01ca1700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cc60).fault with not hing to send, going to standby 2016-09-21 15:13:26.067865 7fbaf43c9700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0 c=0x37c902c0).fault with no thing to send, going to standby 2016-09-21 15:13:26.067910 7fbaf4ed4700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.236:0/650394434 pipe(0x2fbe0000 sd=116 :6802 s=2 pgs=2 cs=1 l=0 c=0x56a5440).fault with noth ing to send, going to standby 2016-09-21 15:13:26.067911 7fbb01196700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cf20).fault with not hing to send, going to standby 2016-09-21 15:13:26.068076 7fbafc64b700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0 c=0x31f7a100).fault with no thing to send, going to standby 2016-09-21 15:13:26.067717 7fbaf64ea700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e8420).fault w[0/9326]ing to send, going to standby 2016-09-21 15:13:26.067725 7fbb0098e700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0 c=0x15c29020).fault with noth ing to send, going to standby 2016-09-21 15:13:26.067743 7fbb026ab700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e91e0).fault, server, going to standby 2016-09-21 15:13:26.067749 7fbae840a700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3290400005 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0 c=0x13b6c160).fault with not hing to send, going to standby 2016-09-21 15:13:26.067783 7fbadb239700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0 c=0x28e24f20).fault with noth ing to send, going to standby 2016-09-21 15:13:26.067803 7fbafe66b700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e9760).fault with not hing to send, going to standby 2016-09-21 15:13:26.067876 7fbb01a9f700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92f5a0).fault with nothing to send, going to standby 2016-09-21 15:13:26.067886 7fbb01ca1700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cc60).fault with nothing to send, going to standby 2016-09-21 15:13:26.067865 7fbaf43c9700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0 c=0x37c902c0).fault with nothing to send, going to standby 2016-09-21 15:13:26.067910 7fbaf4ed4700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.236:0/650394434 pipe(0x2fbe0000 sd=116 :6802 s=2 pgs=2 cs=1 l=0 c=0x56a5440).fault with nothing to send, going to standby 2016-09-21 15:13:26.067911 7fbb01196700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cf20).fault with nothing to send, going to standby 2016-09-21 15:13:26.068076 7fbafc64b700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0 c=0x31f7a100).fault with nothing to send, going to standby 2016-09-21 15:13:26.068095 7fbafff84700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.140:0/1112150414 pipe(0x5679000 sd=125 :6802 s=2 pgs=2 cs=1 l=0 c=0x41bc7e0).fault with nothing to send, going to standby 2016-09-21 15:13:26.068108 7fbb0de0e700 5 mds.0.953 handle_mds_map epoch 8471 from mon.3 2016-09-21 15:13:26.068114 7fbaf890e700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.238:0/1422203298 pipe(0x29630000 sd=44 :6802 s=2 pgs=2 cs=1 l=0 c=0x3a740dc0).fault with not hing to send, going to standby 2016-09-21 15:13:26.068143 7fbae860c700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.217:0/1120082018 pipe(0x2a724000 sd=121 :6802 s=2 pgs=2 cs=1 l=0 c=0x31f79e40).fault with no thing to send, going to standby 2016-09-21 15:13:26.068190 7fbb040c5700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.218:0/3945638891 pipe(0x50c0000 sd=53 :6802 s=2 pgs=2 cs=1 l=0 c=0x56f4420).fault with nothi ng to send, going to standby 2016-09-21 15:13:26.068200 7fbaf961b700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.144:0/2952053583 pipe(0x318dc000 sd=81 :6802 s=2 pgs=2 cs=1 l=0 c=0x286fa840).fault with not hing to send, going to standby 2016-09-21 15:13:26.068232 7fbaf981d700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.159:0/1872775873 pipe(0x268d7000 sd=38 :6802 s=2 pgs=2 cs=1 l=0 c=0x56f6940).fault with noth ing to send, going to standby 2016-09-21 15:13:26.068253 7fbaeac32700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.186:0/4141441999 pipe(0x54e7000 sd=86 :6802 s=2 pgs=2 cs=1 l=0 c=0x286fb760).fault with noth ing to send, going to standby 2016-09-21 15:13:26.068275 7fbb0de0e700 1 mds.-1.-1 handle_mds_map i (192.168.1.196:6802/13581) dne in the mdsmap, respawning myself 2016-09-21 15:13:26.068289 7fbb0de0e700 1 mds.-1.-1 respawn 2016-09-21 15:13:26.068294 7fbb0de0e700 1 mds.-1.-1 e: 'ceph-mds' 2016-09-21 15:13:26.173095 7f689baa8780 0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mds, pid 13581 2016-09-21 15:13:26.175664 7f689baa8780 -1 mds.-1.0 log_to_monitors {default=true} 2016-09-21 15:13:27.329181 7f68969e9700 1 mds.-1.0 handle_mds_map standby 2016-09-21 15:13:28.484148 7f68969e9700 1 mds.-1.0 handle_mds_map standby 2016-09-21 15:13:33.280376 7f68969e9700 1 mds.-1.0 handle_mds_map standby On 9/21/16, 10:48 AM, "Heller, Chris" <cheller@xxxxxxxxxx> wrote: I’ll see if I can capture the output the next time this issue arises, but in general the output looks as if nothing is wrong. No OSD are down, a ‘ceph health detail’ results in HEALTH_OK, the mds server is in the up:active state, in general it’s as if nothing is wrong server side (at least from the summary). -Chris On 9/21/16, 10:46 AM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote: On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <cheller@xxxxxxxxxx> wrote: > I’m running a production 0.94.7 Ceph cluster, and have been seeing a > periodic issue arise where in all my MDS clients will become stuck, and the > fix so far has been to restart the active MDS (sometimes I need to restart > the subsequent active MDS as well). > > > > These clients are using the cephfs-hadoop API, so there is no kernel client, > or fuse api involved. When I see clients get stuck, there are messages > printed to stderr like the following: > > > > 2016-09-21 10:31:12.285030 7fea4c7fb700 0 – 192.168.1.241:0/1606648601 >> > 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0 > c=0x7feaa0a0c500).fault > > > > I’m at somewhat of a loss on where to begin debugging this issue, and wanted > to ping the list for ideas. What's the full output of "ceph -s" when this happens? Have you looked at the MDS' admin socket's ops-in-flight, and that of the clients?j http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some as well. > > > > I managed to dump the mds cache during one of the stalled moments, which > hopefully is a useful starting point: > > > > e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8 > mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg) > > > > > > -Chris > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com