Re: Faulting MDS clients, HEALTH_OK

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure.

Now here is the result of `ceph –s`

# ceph -s
    cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
     health HEALTH_OK
     monmap e1: 5 mons at {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
            election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191
     mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 up:standby
     osdmap e118886: 192 osds: 192 up, 192 in
      pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects
            69601 GB used, 37656 GB / 104 TB avail
               11309 active+clean
                  13 active+clean+scrubbing
                   6 active+clean+scrubbing+deep

And here are the ops in flight:

# ceph daemon mds.a190 dump_ops_in_flight
{
    "ops": [],
    "num_ops": 0
}

And a tail of the active mds log at debug_mds 5/5

2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server handle_client_request client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 20:15:53.352390) v2
2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session closed|closing|killing, dropping
2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 235) v1 from client.507429717
2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode
2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 233) v1 from client.538555196
2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 365) v1 from client.507390467
2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 367) v1 from client.538485341
2016-09-21 20:15:56.243462 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 189) v1 from client.538577596
2016-09-21 20:15:56.986901 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 232) v1 from client.507430372
2016-09-21 20:15:57.026206 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885158
2016-09-21 20:15:57.369281 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390682
2016-09-21 20:15:57.445687 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538485996
2016-09-21 20:15:57.579268 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538486021
2016-09-21 20:15:57.595568 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390702
2016-09-21 20:15:57.604356 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390712
2016-09-21 20:15:57.693546 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390717
2016-09-21 20:15:57.819536 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885168
2016-09-21 20:15:57.894058 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390732
2016-09-21 20:15:57.983329 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.507390742
2016-09-21 20:15:58.077915 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.538486031
2016-09-21 20:15:58.141710 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885178
2016-09-21 20:15:58.159134 7fce3b626700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885188

-Chris

On 9/21/16, 11:23 AM, "Heller, Chris" <cheller@xxxxxxxxxx> wrote:

    Perhaps related, I was watching the active mds with debug_mds set to 5/5, when I saw this in the log:
    
    2016-09-21 15:13:26.067698 7fbaec248700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0 c=0x5631ce0).fault with nothing to send, going to standby
    2016-09-21 15:13:26.067717 7fbaf64ea700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e8420).fault with nothing to send, going to standby
    2016-09-21 15:13:26.067725 7fbb0098e700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0 c=0x15c29020).fault with noth
    ing to send, going to standby
    2016-09-21 15:13:26.067743 7fbb026ab700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e91e0).fault, server,
    going to standby
    2016-09-21 15:13:26.067749 7fbae840a700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3290400005 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0 c=0x13b6c160).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.067783 7fbadb239700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0 c=0x28e24f20).fault with noth
    ing to send, going to standby
    2016-09-21 15:13:26.067803 7fbafe66b700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e9760).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.067876 7fbb01a9f700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92f5a0).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.067886 7fbb01ca1700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cc60).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.067865 7fbaf43c9700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0 c=0x37c902c0).fault with no
    thing to send, going to standby
    2016-09-21 15:13:26.067910 7fbaf4ed4700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.236:0/650394434 pipe(0x2fbe0000 sd=116 :6802 s=2 pgs=2 cs=1 l=0 c=0x56a5440).fault with noth
    ing to send, going to standby
    2016-09-21 15:13:26.067911 7fbb01196700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cf20).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.068076 7fbafc64b700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0 c=0x31f7a100).fault with no
    thing to send, going to standby
    2016-09-21 15:13:26.067717 7fbaf64ea700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e8420).fault w[0/9326]ing to send, going to standby
    2016-09-21 15:13:26.067725 7fbb0098e700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0 c=0x15c29020).fault with noth
    ing to send, going to standby
    2016-09-21 15:13:26.067743 7fbb026ab700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e91e0).fault, server,
    going to standby
    2016-09-21 15:13:26.067749 7fbae840a700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.214:0/3290400005 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0 c=0x13b6c160).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.067783 7fbadb239700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0 c=0x28e24f20).fault with noth
    ing to send, going to standby
    2016-09-21 15:13:26.067803 7fbafe66b700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0 c=0x237e9760).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.067876 7fbb01a9f700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92f5a0).fault with nothing to send, going to standby
    2016-09-21 15:13:26.067886 7fbb01ca1700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cc60).fault with nothing to send, going to standby
    2016-09-21 15:13:26.067865 7fbaf43c9700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0 c=0x37c902c0).fault with nothing to send, going to standby
    2016-09-21 15:13:26.067910 7fbaf4ed4700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.236:0/650394434 pipe(0x2fbe0000 sd=116 :6802 s=2 pgs=2 cs=1 l=0 c=0x56a5440).fault with nothing to send, going to standby
    2016-09-21 15:13:26.067911 7fbb01196700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0 c=0x2f92cf20).fault with nothing to send, going to standby
    2016-09-21 15:13:26.068076 7fbafc64b700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0 c=0x31f7a100).fault with nothing to send, going to standby
    2016-09-21 15:13:26.068095 7fbafff84700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.140:0/1112150414 pipe(0x5679000 sd=125 :6802 s=2 pgs=2 cs=1 l=0 c=0x41bc7e0).fault with nothing to send, going to standby
    2016-09-21 15:13:26.068108 7fbb0de0e700  5 mds.0.953 handle_mds_map epoch 8471 from mon.3
    2016-09-21 15:13:26.068114 7fbaf890e700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.238:0/1422203298 pipe(0x29630000 sd=44 :6802 s=2 pgs=2 cs=1 l=0 c=0x3a740dc0).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.068143 7fbae860c700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.217:0/1120082018 pipe(0x2a724000 sd=121 :6802 s=2 pgs=2 cs=1 l=0 c=0x31f79e40).fault with no
    thing to send, going to standby
    2016-09-21 15:13:26.068190 7fbb040c5700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.218:0/3945638891 pipe(0x50c0000 sd=53 :6802 s=2 pgs=2 cs=1 l=0 c=0x56f4420).fault with nothi
    ng to send, going to standby
    2016-09-21 15:13:26.068200 7fbaf961b700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.144:0/2952053583 pipe(0x318dc000 sd=81 :6802 s=2 pgs=2 cs=1 l=0 c=0x286fa840).fault with not
    hing to send, going to standby
    2016-09-21 15:13:26.068232 7fbaf981d700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.159:0/1872775873 pipe(0x268d7000 sd=38 :6802 s=2 pgs=2 cs=1 l=0 c=0x56f6940).fault with noth
    ing to send, going to standby
    2016-09-21 15:13:26.068253 7fbaeac32700  0 -- 192.168.1.196:6802/13581 >> 192.168.1.186:0/4141441999 pipe(0x54e7000 sd=86 :6802 s=2 pgs=2 cs=1 l=0 c=0x286fb760).fault with noth
    ing to send, going to standby
    2016-09-21 15:13:26.068275 7fbb0de0e700  1 mds.-1.-1 handle_mds_map i (192.168.1.196:6802/13581) dne in the mdsmap, respawning myself
    2016-09-21 15:13:26.068289 7fbb0de0e700  1 mds.-1.-1 respawn
    2016-09-21 15:13:26.068294 7fbb0de0e700  1 mds.-1.-1  e: 'ceph-mds'
    2016-09-21 15:13:26.173095 7f689baa8780  0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mds, pid 13581
    2016-09-21 15:13:26.175664 7f689baa8780 -1 mds.-1.0 log_to_monitors {default=true}
    2016-09-21 15:13:27.329181 7f68969e9700  1 mds.-1.0 handle_mds_map standby
    2016-09-21 15:13:28.484148 7f68969e9700  1 mds.-1.0 handle_mds_map standby
    2016-09-21 15:13:33.280376 7f68969e9700  1 mds.-1.0 handle_mds_map standby
    
    On 9/21/16, 10:48 AM, "Heller, Chris" <cheller@xxxxxxxxxx> wrote:
    
        I’ll see if I can capture the output the next time this issue arises, but in general the output looks as if nothing is wrong. No OSD are down, a ‘ceph health detail’ results in HEALTH_OK, the mds server is in the up:active state, in general it’s as if nothing is wrong server side (at least from the summary).
        
        -Chris
        
        On 9/21/16, 10:46 AM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote:
        
            On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <cheller@xxxxxxxxxx> wrote:
            > I’m running a production 0.94.7 Ceph cluster, and have been seeing a
            > periodic issue arise where in all my MDS clients will become stuck, and the
            > fix so far has been to restart the active MDS (sometimes I need to restart
            > the subsequent active MDS as well).
            >
            >
            >
            > These clients are using the cephfs-hadoop API, so there is no kernel client,
            > or fuse api involved. When I see clients get stuck, there are messages
            > printed to stderr like the following:
            >
            >
            >
            > 2016-09-21 10:31:12.285030 7fea4c7fb700  0 – 192.168.1.241:0/1606648601 >>
            > 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0
            > c=0x7feaa0a0c500).fault
            >
            >
            >
            > I’m at somewhat of a loss on where to begin debugging this issue, and wanted
            > to ping the list for ideas.
            
            What's the full output of "ceph -s" when this happens? Have you looked
            at the MDS' admin socket's ops-in-flight, and that of the clients?j
            
            http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some as well.
            
            >
            >
            >
            > I managed to dump the mds cache during one of the stalled moments, which
            > hopefully is a useful starting point:
            >
            >
            >
            > e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8
            > mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)
            >
            >
            >
            >
            >
            > -Chris
            >
            >
            >
            >
            > _______________________________________________
            > ceph-users mailing list
            > ceph-users@xxxxxxxxxxxxxx
            > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
            >
            
        
        _______________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        
    
    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux