Re: Revisiting: Many clients (X) failing to respond to cache pressure

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Fri, 16 Dec 2016 13:19:18 +1100



    Hi John...
    Regarding logs, we still do not have them available. We just
      realized that ceph-fuse tries to log to /var/log/ceph, which in
      our case didn't exist in the clients. So, we had to create that
      directory everywhere, and we are in the process of remounting
      every client so that they starts logging. Since by umounting we
      are forcing the client to free the inodes, we have to wait for the
      situation to reappear.
    However. I have a bit further information. Maybe it can shed a
      further light on the topic.
    - If I currently loop through all my clients, I now get a total of
    29604 inodes.

    
       $ cat clients_inodes_20161216-0938.txt | grep inode_count |
        awk '{print $2}' | sed 's/,//g' | awk '{s+=$1} END {print s}'

        29604

      
    - However, the mds reports '"inodes": 1779521' and
      '"inodes_with_caps": 32823,'. Is there a need for the MDS to keep
      in memory such a large amount of inodes without associated caps? I
      also expect that these are the first ones to be trimmed once
      inodes > inode_max. 

    
      "mds": {

              (...)

              "inode_max": 2000000,

              "inodes": 1779521,

              "inodes_top": 18119,

              "inodes_bottom": 1594129,

              "inodes_pin_tail": 167273,

              "inodes_pinned": 182643,

              "inodes_expired": 53391877,

              "inodes_with_caps": 32823,

              "caps": 35133,

             (...)

          },
    - I am also seeing some false positives (I think). As I explained
      before, we have currently umounted all clients except 2 (they are
      interactive machines where our users run tmux sessions and so
      on... So, it is hard to kick them out :-) ). One of those two is
      still reported as problematic by MDS although inodes <
      inodes_max. Looking to the number of inodes of that machine, I get
      "inode_count": 13862. So, it seems that the client is still tagged
      as problematic although it has an inode_count bellow 16384 and
      inodes < inodes_max. Maybe a consequence of https://github.com/ceph/ceph/pull/11373
      ? And this fix seems to only go on Kraken?

    
    Cheers
    Goncalo

    
    Cheers
    Goncalo

    
    On 12/14/2016 10:16 AM, Goncalo Borges
      wrote:

    
    Hi John.
      

      Comments in line.
      

        Hi Ceph(FS)ers...
          

          I am currently running in production the following
          environment:
          

          - ceph/cephfs in 10.2.2.
          

          - All infrastructure is in the same version (rados cluster,
          mons, mds and
          

          cephfs clients).
          

          - We mount cephfs using ceph-fuse.
          

          Since yesterday that we have our cluster in warning state with
          the message
          

          "mds0: Many clients (X) failing to respond to cache pressure".
          X has been
          

          changing with time, from ~130 to ~70. I am able to correlate
          the appearance
          

          of this message with burst of jobs in our cluster.
          

          This subject has been discussed in the mailing list a lot of
          times, and
          

          normally, the recipe is to look for something wrong in the
          clients. So, I
          

          have tried to look to clients first:
          

          1) I've started to loop through all my clients, and run 'ceph
          --admin-daemon
          

          /var/run/ceph/ceph-client.mount_user.asok status' to get the
          inodes_count
          

          reported in each client.
          

          $ cat all.txt | grep inode_count | awk '{print $2}' | sed
          's/,//g' | awk
          

          '{s+=$1} END {print s}'
          

          2407659
          

          2) I've then compared with the number of inodes the mds had in
          its cache
          

          (obtained by a perf dump)
          

                    inode_max": 2000000 and "inodes": 2413826
          

          3) I've tried to understand how many clients had a number of
          inodes higher
          

          than 16384 (the default) and got
          

          $ for i in `cat all.txt | grep inode_count | awk '{print $2}'
          | sed 's/,//g'
          

          `; do if [ $i -ge 16384 ]; then echo $i; fi; done | wc -l
          

          27
          

          4) My conclusion is that the core of inodes is held by a
          couple of machines.
          

          However, while the majority is running user jobs, others are
          not doing
          

          anything at all. For example, an idle machine (which had no
          users logged in,
          

          no jobs running, updatedb does not search for cephfs
          filesystem) reported
          

          more than > 300000 inodes). To regain those inodes, I had
          to umount and
          

          remount cephfs in that machine.
          

          5) Based on my previous observations I suspect that there are
          still some
          

          problems in the ceph-fuse client regarding recovering these
          inodes (or it
          

          happens at a very slow rate).
          

        Seems that way.  Can you come up with a reproducer for us,
        and/or
        

        gather some client+mds debug logs where a client is failing to
        respond
        

        to cache pressure?
        

      I think I've nailed this down to a specific user workload.
      Everytime this user runs, it lefts the client with a huge number
      of inodes, normally more than 100000. The workload consists in the
      generations of a big number of analysis files spread over multiple
      directories. I am going to try to inject some debug parameters and
      see what do we come up with. Will reply on this thread later on.
      

      Also, what kernel is in use on the
        clients?  It's possible that the
        

        issue is in FUSE itself (or the way that it responses to
        ceph-fuse's
        

        attempts to ask for some inodes to be released).
        

      All our clusters run SL6 because CERN experiments software is only
      certified to that OS flavour. Because of the SL6 restriction, to
      enable pos infernalis ceph clients in those machines, we have to
      recompile them as well as some of the dependencies it needs and
      which are not available in SL6. In summary, we recompile ceph-fuse
      10.2.2 with gcc 4.8.4 against boost-1.53.0-25 and fuse-2.9.7. The
      kernel version in the clients is 2.6.32-642.6.2.el6.x86_64
      

      Thanks for the explanations about the mds memory usage. I am glad
      there is something on its way to trigger a more effective memory
      usage
      

      Cheers
      

      Goncalo
      

        However, I also do not completely
          understand what is happening on the server
          

          side:
          

          6) The current memory usage of my mds is the following:
          

             PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM    
          TIME+ COMMAND
          

          17831 ceph      20   0 13.667g 0.012t  10048 S  37.5 40.2  
          1068:47 ceph-mds
          

          The mds cache size is set to 2000000. Running 'ceph daemon
          mds.<id> perf
          

          dump', I get  "inode_max": 2000000 and "inodes": 2413826.
          Assuming 4k per
          

          each inode one gets ~10G. So why it is taking much more than
          that?
          

          7) I have been running cephfs for more than an year, and
          looking to ganglia,
          

          the mds memory never decreases but always increases (even in
          cases when we
          

          umount almost all the clients). Why does that happen?
          

        Coincidentally someone posted about this on ceph-devel just
        yesterday.
        

        The answer is that the MDS uses memory pools for allocation, and
        it
        

        doesn't (currently) ever bother releasing memory back to the
        operating
        

        system because it's doing its own cache size enforcement. 
        However,
        

        when the cache size limits aren't being enforced (for example
        because
        

        of clients failing to release caps) this becomes a problem. 
        There's a
        

        patch for master (https://github.com/ceph/ceph/pull/12443)
        

        8) I am running 2 mds, in active /
          standby-replay mode. The memory of the
          

          standby-replay is much lower
          

             PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM    
          TIME+ COMMAND
          

             716 ceph      20   0 6149424 5.115g   8524 S   1.2 43.6 
          53:19.74 ceph-mds
          

          If I trigger a restart on my active mds, the standby replay
          will start
          

          acting as active, but will continue with the same amount of
          memory. Why the
          

          second mds can become active, and do the same job but using
          much more
          

          memory?
          

        Presumably this also makes sense once you know about the
        allocator in use.
        

        9) Finally, I am sending an extract of
          'ceph daemon mds.<id> perf dump' from
          

          my active and standby mdses. What is exactly the meaning of
          inodes_pin_tail,
          

          inodes_expired and inodes_with_caps? Is the standby mds
          suppose to show the
          

          same numbers? They don't...
          

        It's not really possible to explain these counters without a
        

        substantial explanation of MDS internals, sorry.  I will say
        though
        

        that there is absolutely no guarantee of performance counters on
        the
        

        standby replay daemon matching those on the active daemon.
        

        John
        

        Thanks in advance for your answers / 
          suggestions
          

          Cheers
          

          Goncalo
          

          active:
          

               "mds": {
          

                   "request": 93941296,
          

                   "reply": 93940671,
          

                   "reply_latency": {
          

                       "avgcount": 93940671,
          

                       "sum": 188398.004552299
          

                   },
          

                   "forward": 0,
          

                   "dir_fetch": 309878,
          

                   "dir_commit": 1736194,
          

                   "dir_split": 0,
          

                   "inode_max": 2000000,
          

                   "inodes": 2413826,
          

                   "inodes_top": 201,
          

                   "inodes_bottom": 568,
          

                   "inodes_pin_tail": 2413057,
          

                   "inodes_pinned": 2413303,
          

                   "inodes_expired": 19693168,
          

                   "inodes_with_caps": 2409737,
          

                   "caps": 2440565,
          

                   "subtrees": 2,
          

                   "traverse": 113291068,
          

                   "traverse_hit": 57822611,
          

                   "traverse_forward": 0,
          

                   "traverse_discover": 0,
          

                   "traverse_dir_fetch": 154708,
          

                   "traverse_remote_ino": 1085,
          

                   "traverse_lock": 66063,
          

                   "load_cent": 9394314733,
          

                   "q": 22,
          

                   "exported": 0,
          

                   "exported_inodes": 0,
          

                   "imported": 0,
          

                   "imported_inodes": 0
          

               },
          

          standby-replay:
          

               "mds": {
          

                   "request": 0,
          

                   "reply": 0,
          

                   "reply_latency": {
          

                       "avgcount": 0,
          

                       "sum": 0.000000000
          

                   },
          

                   "forward": 0,
          

                   "dir_fetch": 0,
          

                   "dir_commit": 0,
          

                   "dir_split": 0,
          

                   "inode_max": 2000000,
          

                   "inodes": 2000058,
          

                   "inodes_top": 0,
          

                   "inodes_bottom": 1993207,
          

                   "inodes_pin_tail": 6851,
          

                   "inodes_pinned": 124135,
          

                   "inodes_expired": 10651484,
          

                   "inodes_with_caps": 0,
          

                   "caps": 0,
          

                   "subtrees": 2,
          

                   "traverse": 0,
          

                   "traverse_hit": 0,
          

                   "traverse_forward": 0,
          

                   "traverse_discover": 0,
          

                   "traverse_dir_fetch": 0,
          

                   "traverse_remote_ino": 0,
          

                   "traverse_lock": 0,
          

                   "load_cent": 0,
          

                   "q": 0,
          

                   "exported": 0,
          

                   "exported_inodes": 0,
          

                   "imported": 0,
          

                   "imported_inodes": 0
          

               },
          

          --
          

          Goncalo Borges
          

          Research Computing
          

          ARC Centre of Excellence for Particle Physics at the Terascale
          

          School of Physics A28 | University of Sydney, NSW  2006
          

          T: +61 2 93511937
          

          _______________________________________________
          

          ceph-users mailing list
          

          ceph-users@xxxxxxxxxxxxxx
          

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
          

    -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com