Faulting MDS clients, HEALTH_OK

"Heller, Chris" <cheller@xxxxxxxxxx> · Wed, 21 Sep 2016 13:30:52 +0000

I’m running a production 0.94.7 Ceph cluster, and have been seeing a periodic issue arise where in all my MDS clients will become stuck, and the fix so far has been to restart
 the active MDS (sometimes I need to restart the subsequent active MDS as well).

These clients are using the cephfs-hadoop API, so there is no kernel client, or fuse api involved. When I see clients get stuck, there are messages printed to stderr like the following:

2016-09-21 10:31:12.285030 7fea4c7fb700  0 – 192.168.1.241:0/1606648601 >> 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0 c=0x7feaa0a0c500).fault

I’m at somewhat of a loss on where to begin debugging this issue, and wanted to ping the list for ideas.

I managed to dump the mds cache during one of the stalled moments, which hopefully is a useful starting point:

e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8  mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)

-Chris

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com