Re: Possible SHD stalling

Jaco Kroon <jaco@xxxxxxxxx> · Sat, 9 May 2020 18:33:14 +0200

    Hi,
    On 2020/05/08 17:47, Ravishankar N wrote:

      On 08/05/20 7:07 pm, Jaco Kroon wrote:

        I'm not sure "stuck" is the right word, but looking at the
        "statistics heal-count" values it goes into a form of "go slow"
        mode, and ends up adding more entries for heal in some cases at
        a rate of about 2 every second, sometimes 4 at worst (based on
        ~10-15 second intervals depending on how long it takes to gather
        the heal-counts).

        My guess currently is that since the log entries that scroll by
        relates to metadata and entry heals, these are when it's healing
        files, and when it hits a folder it somehow locks something,
        then iterates through the folder (very) slowly for some reason,
        and whilst this is happening, directory listings are outright
        unusable.

        Is there lock profiling code available in glusterfs?

      Nope, nothing as of today as far as I know. You can have the locks
      xlator log everything by setting the locks.trace option to ON
      though.

    Might be worthwhile to look at building something in.  Herewith
      so long my strace analysis on one of the glsuterfsds:
    strace -p 2827 -f -T 2>&1 >/tmp/strace.txt

    This pid represents the brick process for zoidberg:/mnt/gluster/a

    Processed output (a b c) represents "system call c too between [b,
      b+1) seconds to return during the sample period a
      number of times".  I've marked futex (used for contested mutex
      cases) calls in bold.  The only system calls ever to take longer
      than 1.000000s were:  nanosleep, futex, epoll_wait (which is hard
      to say what that could be in terms of operations, or just blocking
      waiting for client requests), select (same comment as epoll_wait)
      and restart_syscall (which I'm betting is safe to assume is
      continuations of other syscalls mentioned here).

          5 30 nanosleep <-- 5 cases of nanosleep
        sleeping for 30 seconds.

            2 27 futex

            1 27 epoll_wait

            1 23 restart_syscall

            1 21 restart_syscall

            1 21 futex

            1 21 epoll_wait

            2 20 futex

            2 20 epoll_wait

            1 18 futex

            1 18 epoll_wait

            1 16 restart_syscall

            2 13 futex

              1 12 futex

             19 11 futex

              2 11 epoll_wait

            2 10 restart_syscall

           10 10 futex

              1 10 epoll_wait

            1 8 restart_syscall

            4 8 futex

              1 8 epoll_wait

            2 7 restart_syscall

            5 6 futex

              2 6 epoll_wait

           24 5 nanosleep

            2 5 futex

              1 4 restart_syscall

            9 4 futex

              2 4 epoll_wait

           19 3 futex

              4 3 epoll_wait

            1 2 restart_syscall

           69 2 futex

             12 2 epoll_wait

          522 1 select

            1 1 restart_syscall

            7 1 nanosleep

          338 1 futex

             43 1 epoll_wait

        19630 0 writev

         2467 0 unlink

            2 0 tls=0x7f1338009480,

            1 0 tls=0x7f1318df9480,

           32 0 statfs

         3882 0 stat

            8 0 set_robust_list

            8 0 select

            4 0 rt_sigqueueinfo

           49 0 rt_sigprocmask

           16 0 restart_syscall

        39018 0 readv

        23490 0 readlink

           62 0 pwrite64

          158 0 pread64

          836 0 openat

           87 0 mprotect

            4 0 mkdir

            8 0 madvise

        95834 0 lstat

         3068 0 lsetxattr

           37 0 lseek

        16474 0 llistxattr

         3067 0 linkat

        36345 0 lgetxattr

            4 0 getuid

            4 0 getpid

          111 0 getdents64

       399210 0 futex

          300 0 fstat

           62 0 fsetxattr

            2 0 flistxattr

          291 0 fgetxattr

        19172 0 epoll_wait

        19244 0 epoll_ctl

          843 0 close

            5 0 clone

            2 0 chown

    So, looking at just the futex calls:
          2 27 futex

            1 21 futex

            2 20 futex

            1 18 futex

            2 13 futex

            1 12 futex

           19 11 futex

           10 10 futex

            4 8 futex

            5 6 futex

            2 5 futex

            9 4 futex

           19 3 futex

           69 2 futex

          338 1 futex

       399210 0 futex

    That still indicates that the FAR MAJORITY of mutex locks were
      fast.  Performance stats for the same brick (I marked that which I
      find interesting/relevant in bold again):

    Brick: zoidberg:/mnt/gluster/a

      -----------------------------------------

      Cumulative Stats:

         Block Size:                  1b+                  
        4b+                   8b+ 

       No. of Reads:                    0                    
        0                     0 

      No. of Writes:                   23                   
        10                    74 

         Block Size:                 16b+                 
        32b+                  64b+ 

       No. of Reads:                    0                  
        248                     1 

      No. of Writes:                  155                 
        1340                    32 

         Block Size:                128b+                
        256b+                 512b+ 

       No. of Reads:                    1                   
        27                    71 

      No. of Writes:                   63                  
        779                   186 

         Block Size:               1024b+               
        2048b+                4096b+ 

       No. of Reads:                   19                   
        31                    43 

      No. of Writes:                  387                  
        797                  1490 

         Block Size:               8192b+              
        16384b+               32768b+ 

       No. of Reads:                  147                  
        254                   508 

      No. of Writes:                41360                 
        3995                 34916 

         Block Size:              65536b+              131072b+

       No. of Reads:                  941                 52418

      No. of Writes:                 9285                 18319

       %-latency   Avg-latency   Min-Latency   Max-Latency  
        No. of calls         Fop

       ---------   -----------   -----------   -----------  
        ------------        ----

            0.00       0.00 us       0.00 us       0.00
        us         627326      FORGET

            0.00       0.00 us       0.00 us       0.00
        us         922811     RELEASE

            0.00       0.00 us       0.00 us       0.00
        us          35290  RELEASEDIR

            0.00      10.31 us       5.72 us      14.90
        us              2         IPC

            0.00      40.70 us      40.70 us      40.70
        us              1          LK

            0.00      65.64 us      65.64 us      65.64
        us              1        STAT

            0.00      70.26 us      70.26 us      70.26
        us              1        READ

            0.00     176.04 us     176.04 us     176.04
        us              1        LINK

            0.00      42.11 us      29.10 us      63.81
        us             11    FINODELK

            0.00     282.23 us     192.42 us     372.03
        us              2    TRUNCATE

            0.00      65.89 us      20.84 us     316.68
        us             11       FLUSH

            0.00     211.73 us      25.36 us     411.40
        us              4    READDIRP

            0.00      60.37 us      37.28 us     135.28
        us             20      STATFS

            0.00     509.01 us     176.30 us    1213.35
        us              4      UNLINK

            0.00     414.76 us      51.19 us    2698.67
        us             28     OPENDIR

            0.01     349.84 us     115.94 us    1342.20
        us             48       WRITE

            0.04      90.44 us      36.50 us     615.58
        us           1199    READLINK

            0.06   25088.70 us    7133.79 us   59634.44
        us              7      CREATE

            0.12    7024.50 us      32.80 us   57327.60
        us             50     READDIR

            0.16     624.73 us      49.54 us   94746.41
        us            751        OPEN

            0.62  167828.35 us     543.79 us  387838.03
          us             11    FXATTROP

            0.65     430.52 us      16.83 us  485714.99
        us           4526     INODELK

            0.70     618.44 us      14.05 us  178158.97
        us           3401     ENTRYLK

            2.12   71297.23 us      16.61 us 2286254.75
          us             89    GETXATTR

            2.59    1054.74 us      25.63 us  178938.38
        us           7323      LOOKUP

           92.94   88173.46 us     141.60 us  519829.21
          us           3149     XATTROP

            0.00       0.00 us       0.00 us       0.00
        us           2360      UPCALL

            0.00       0.00 us       0.00 us       0.00
        us              2     CI_IATT

            0.00       0.00 us       0.00 us       0.00
        us              1   CI_UNLINK

            0.00       0.00 us       0.00 us       0.00
        us           2358   CI_FORGET

          Duration: 163988 seconds

         Data Read: 6998476545 bytes

      Data Written: 5064990911 bytes

        I saw this "issue":
        https://github.com/gluster/glusterfs/issues/275

        And there was some related PR I can't track down now which aims
        to wrap pthread mutex_* calls with some other library to time
        locks etc ... having looked at that code it made sense but could
        trivially introduce extra latency of a few micro-seconds even in
        the optimal case, and at worst could change lock behaviour (all
        _lock first does _trylock then re-issues _lock) - although I
        haven't fully analysed the code change.  And one would need to
        modify everywhere where mutex's are called, alternatively
        LD_PRELOAD hacks are useful, but then obtaining code locations
        where locks are initialised and used becomes much, much harder,
        but it would make an analysis of whether locks are contended on
        for extended periods of time (more than a few milliseconds) or
        held for extended times quite trivial, and I'm guessing
        backtrace() from execinfo could be helpful here to at least get
        an idea of which locks are involved.

        This looks like the approach used by mutrace:
        http://0pointer.de/blog/projects/mutrace.html so might just end
        up using that.  Ironically enough if the author of that project
        just added a little bit more graph tracking code he could turn
        it into a deadlock detection tool as well.

        I also see from the profile info on your
          first email that XATTROP fop has the highest latency. Can you
          do an strace of the brick processes to see if there is any
          syscall (setxattr in particular, for the XATTROP fop) that is
          taking higher than usual times?

        I'm familiar with strace.  What am I looking for?  How will I
        identify XATTROP fops and their associated system calls in this
        trace?

      XATTROP is a FOP meant for the index xlator. Index serializes this
      FOP and sends a setxattr to posix xlator and in the callback
      performs some link/unlinks operations inside .glusterfs/indices/.
      So I was wondering if the bottleneck is in the setxattr syscall
      itself (which can be found in the strace -i.e. the time taken for
      the syscall) or because of the serialization inside index. The
      xattr keys will be trusted.afr*. like so:

      [pid 517071] fsetxattr(16, "trusted.afr.dirty",
      "\0\0\0\1\0\0\0\0\0\0\0", 12, 0) = 0 <0.000135>

      [pid 517071]
unlink("/home/ravi/bricks/brick1/.glusterfs/indices/xattrop/6ea17fe6-9080-4054-ab98-775d37ea143b")
      = 0 <0.000086>

    Ok.  So all fsetxattr calls were sub 1ms (during a time where ls
      took 90s±5s).
    unlinks varies wildly, with out of 2467 unlink calls, 1729 were
      sub 1ms.
    21 was [1,5) ms

      17 was [5,10) ms

      24 was [10,20) ms

      112 was [20,50) ms

      368 was [50,100) ms

      196 was [100,200) ms

      27 was [200,300) ms

      6 was >= 300 ms, with the longest clocking in at 347ms.

    This, in my opinion, does not justify stalls on ls upwards of a
      minute for listing.  Currently upwards of 3 minutes and counting.

      But anyway you can check if there are any syscalls(not just
      setxattr) that take a long time on the drives.

    Nothing on the drives, at least as far as I can tell.  Unless
      those epolls and select calls, possibly relate, and even then,
      that was max 7 epoll_wait calls over 10s, the rest all lower than,
      so nothing that in my opinion can justify this level of badness.

        For the slow ls from the client, an
          ongoing selfheal should not affect it, since readdir, stat etc
          are all read-only operations that do not take any locks that
          can compete with heal, and there is no hardware bottleneck at
          the brick side to process requests from what you say. Maybe a
          tcp dump between the client and the server can help find if
          the slowness is from the server side by looking at the request
          and response times of the FOPS in the dump.

        Well, it does.  I've samples a few hundred ls's since my email
        this morning.  All of them were sub 0.5s. 
      I guess you need to check when the heal does happen, *which* FOP
      is slow for the `ls -l` operation. Is it readdirp? stat?

    Ok, so I'm assuming strace -T on ls itself should be revealing. 
      For the moment this OK for the moment via strace (clocking in at
      3.5s for the moment).  There is one solitary fstat() call in the
      trace (only copying relevant):
    openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
      = 3 <1.800382>

      fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
      <0.000019>

      getdents64(3, /* 18 entries */, 131072) = 600 <1.346974>

      getdents64(3, /* 18 entries */, 131072) = 584 <0.000174>

      getdents64(3, /* 18 entries */, 131072) = 584 <0.000239>

      getdents64(3, /* 18 entries */, 131072) = 584 <0.000165>

      getdents64(3, /* 18 entries */, 131072) = 560 <0.343684>

      getdents64(3, /* 18 entries */, 131072) = 568 <0.000272>

      getdents64(3, /* 18 entries */, 131072) = 592 <0.000103>

      getdents64(3, /* 18 entries */, 131072) = 600 <0.000096>

      getdents64(3, /* 18 entries */, 131072) = 584 <0.000166>

      getdents64(3, /* 18 entries */, 131072) = 608 <0.000167>

      getdents64(3, /* 1 entries */, 131072)  = 40 <0.000160>

      getdents64(3, /* 0 entries */, 131072)  = 0 <0.000060>

      close(3)                                = 0 <0.000011>

      In a second trace, the long calls were:
    openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
      = 3 <1.501978>

      fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
      <2.526342>

      For some reason I'm unable to get really, really long ls calls
      right at the moment.

      But in all cases it seems that the openat call takes a long time,
      followed by either a long(ish) fstat or a long(ish) getdents64
      system call.
    Doing strace against another process that reliably gets tripped
      up by this (blocking reliably up to a minute and a half under
      current conditions) I'm also not seeing anything that makes
      sense.  So I'm starting to think in this case it's more like death
      by a million small cuts.

      I will keep on digging but for now I have to switch the SHD off
      ... I'll be creating a separate mount point in a moment and doing
      recursive find . -exec stat {} \; -exec sleep 0.01 \; to see if
      that helps ...

      The only difference is that all three SHDs
        were killed.  Almost without a doubt in my mind there has to be
        some unforeseen interaction. We've been having a few
        discussions, and a very, very long time ago, we effectively
        actioned a full heal with "find /path/to/fuse-mount -exec stat
        {} \;" and other times with "find /path/to/fuse-mount -print0 |
        xargs -0 -n50 -P5 -- stat" - and if we recall correctly we've
        seen similar there.  We thought at that time it was the FUSE
        process causing trouble, and it's entirely possible that that is
        indeed the case, but we are wondering now whether that too not
        maybe related to the heal process that was just happening due to
        the stats (heal on stat).  We generally run with
        cluster.*-self-heal off nowadays and rather rely purely on the
        SHD for performance reasons.

      We've had complaints of client-side heal  (data/metadata/entry)
      stealing too much of i/o bandwidth  which is why turned the
      default values to off in AFR.

    I've never had an issue with i/o bandwidth (mbit/s) here, but the
      additional latency (ms) on pretty much all filesystem access was
      really bad.  I suspect this again relates to use, where we're
      using it as a filesystem where most folks seems to be using it as
      a backing store for disk images for VMs.
    I'm not sure how to proceed from here ... If I'm really, really
      lucky, the above process should finish in a few days at best, but
      it could still take weeks.  What worries me greatly is that I do
      have an upcoming rebalance to add another distribute triplet ...
      and I'm very, very worried that this will have the same potential
      impact for client performance.

    Kind Regards,

      Jaco

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users