Re: Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Thu, 05 Feb 2015 16:44:51 +0530



    On 02/03/2015 11:16 AM, Matt wrote:

    
      Hello List,
      

      So I've been frustraded by intermittent performance problems
      throughout January. The problem occurs on a two node setup running
      3.4.5, 16 gigs of ram with a bunch of local disk. For sometimes an
      hour for sometimes weeks at a time (I have extensive graphs in
      OpenNMS) our Gluster boxes will get their CPUs pegged, and in
      vmstat they'll show extremely high numbers of context switches and
      interrupts. Eventually things calm down. During this time, memory
      usage actually drops. Overall usage on the box goes from between
      6-10 gigs to right around 4 gigs, and stays there. That's what
      really puzzles me.
      

      When performance is problematic, sar shows one device, the
        device corresponding to the glusterfsd problem using all the CPU
        doing lots of little reads, Sometimes 70k/second, very small avg
        rq size, say 10-12. Afraid I don't have any saved output handy,
        but I can try to capture some next time it happens. I have tons
        of information frankly, but am trying to keep this reasonably
        brief.

        
        There are more than a dozen volumes on this two node setup.
          The CPU usage is pretty much entirely contained to one volume,
          a 1.5 TB volume that is just shy of 70% full. It stores
          uploaded files for a web app. What I hate about this app and
          so am always suspicious of, is that it stores a directory for
          every user in one level, so under the /data directory in the
          volume, there are 450,000 sub directories at this point.
        

        The only real mitigation step that's been taken so far was
          to turn off the self-heal daemon on the volume, as I thought
          maybe crawling that large directory was getting expensive.
          This doesn't seem to have done anything as the problem still
          occurs.
      
      
      At this point I figure there are one of two things sorts of
        things happening really broadly: one we're running into some
        sort of bug or performance problem with gluster we should either
        fix perhaps by upgrading or tuning around, or two, some process
        we're running but not aware of is hammering the file system
        causing problems.
      

      If it's the latter option, can anyone give me any tips on
        figuring out what might be hammering the system? I can use
        volume top to see what a brick is doing, but I can't figure out
        how to tell what clients are doing what.
      

      Apologies for the somewhat broad nature of the question, any
        input thoughts would be much appreciated. I can certainly
        provide more info about some things if it would help, but I've
        tried not to write a novel here.
      

      Thanks,
    
    Could you enable 'gluster volume profile <volname> start' for
    this volume?

    When next time this issue happens, keep collecting 'gluster volume
    profile <volname> info' outputs. Mail them and lets see what
    is happening.

    
    Pranith

    
      -Matt
      

      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users