Re: Fuse memleaks, all versions → OOM-killer

Yannick Perret <yannick.perret@xxxxxxxxxxxxx> · Mon, 29 Aug 2016 12:32:59 +0200



    Hello,

      
      back after holidays. I don't saw any new relies after this last
      mail, I hope I don't missed mails (too many mails to parse…).

      
      BTW it seems that my problem is very similar to this opened bug:
      https://bugzilla.redhat.com/show_bug.cgi?id=1369364

      -> memory usage always increasing for (here) read ops until
      reaching all mem/swap, using the fuse client.

      
      Regards,

      --

      Y.

      
      Le 02/08/2016 à 19:15, Yannick Perret a écrit :

    
      In order to prevent too many swap
        usage I removed swap on this machine (swapoff -a).

        Memory usage was still growing.

        After that I started an other program that takes memory (in
        order to accelerate things) and I got the OOM-killer.

        
        Here is the syslog:

        [1246854.291996] Out of memory: Kill process 931 (glusterfs)
        score 742 or sacrifice child

        [1246854.292102] Killed process 931 (glusterfs)
        total-vm:3527624kB, anon-rss:3100328kB, file-rss:0kB

        
        Last VSZ/RSS was: 3527624 / 3097096

        
        Here is the rest of the OOM-killer data:

        [1246854.291847] active_anon:600785 inactive_anon:377188
        isolated_anon:0

         active_file:97 inactive_file:137
        isolated_file:0

         unevictable:0 dirty:0 writeback:1 unstable:0

         free:21740 slab_reclaimable:3309 slab_unreclaimable:3728

         mapped:255 shmem:4267 pagetables:3286 bounce:0

         free_cma:0

        [1246854.291851] Node 0 DMA free:15876kB min:264kB low:328kB
        high:396kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
        isolated(anon):0kB isolated(file):0kB present:15992kB
        managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
        shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
        kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
        free_cma:0kB writeback_tmp:0kB pages_scanned:0
        all_unreclaimable? yes

        [1246854.291858] lowmem_reserve[]: 0 2980 3948 3948

        [1246854.291861] Node 0 DMA32 free:54616kB min:50828kB
        low:63532kB high:76240kB active_anon:1940432kB
        inactive_anon:1020924kB active_file:248kB
        inactive_file:260kB unevictable:0kB
        isolated(anon):0kB isolated(file):0kB present:3129280kB
        managed:3054836kB mlocked:0kB dirty:0kB writeback:0kB
        mapped:760kB shmem:14616kB slab_reclaimable:9660kB
        slab_unreclaimable:8244kB kernel_stack:1456kB pagetables:10056kB
        unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
        pages_scanned:803 all_unreclaimable? yes

        [1246854.291865] lowmem_reserve[]: 0 0 967 967

        [1246854.291867] Node 0 Normal free:16468kB min:16488kB
        low:20608kB high:24732kB active_anon:462708kB
        inactive_anon:487828kB active_file:140kB
        inactive_file:288kB unevictable:0kB
        isolated(anon):0kB isolated(file):0kB present:1048576kB
        managed:990356kB mlocked:0kB dirty:0kB writeback:4kB
        mapped:260kB shmem:2452kB slab_reclaimable:3576kB
        slab_unreclaimable:6668kB kernel_stack:560kB pagetables:3088kB
        unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
        pages_scanned:975 all_unreclaimable? yes

        [1246854.291872] lowmem_reserve[]: 0 0 0 0

        [1246854.291874] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 2*32kB (U)
        3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R)
        3*4096kB (EM) = 15876kB

        [1246854.291882] Node 0 DMA32: 1218*4kB (UEM) 848*8kB (UE)
        621*16kB (UE) 314*32kB (UEM) 189*64kB (UEM) 49*128kB (UEM)
        2*256kB (E) 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 54616kB

        [1246854.291891] Node 0 Normal: 3117*4kB (UE) 0*8kB 0*16kB
        3*32kB (R) 1*64kB (R) 2*128kB (R) 0*256kB 1*512kB (R) 1*1024kB
        (R) 1*2048kB (R) 0*4096kB = 16468kB

        [1246854.291900] Node 0 hugepages_total=0 hugepages_free=0
        hugepages_surp=0 hugepages_size=2048kB

        [1246854.291902] 4533 total pagecache pages

        [1246854.291903] 0 pages in swap cache

        [1246854.291905] Swap cache stats: add 343501, delete 343501,
        find 7730690/7732743

        [1246854.291906] Free swap  = 0kB

        [1246854.291907] Total swap = 0kB

        [1246854.291908] 1048462 pages RAM

        [1246854.291909] 0 pages HighMem/MovableOnly

        [1246854.291909] 14555 pages reserved

        [1246854.291910] 0 pages hwpoisoned

        
        Regards,

        --

        Y.

        
        Le 02/08/2016 à 17:00, Yannick Perret a écrit :

      
        So here are the dumps, gzip'ed.

          
          What I did:

          1. mounting the volume, removing all its content, umounting it

          2. mounting the volume

          3. performing a cp -Rp /usr/* /root/MNT

          4. performing a rm -rf /root/MNT/*

          5. taking a dump (glusterdump.p1.dump)

          6. re-doing 3, 4 and 5 (glusterdump.p2.dump)

          
          VSZ/RSS are respectively:

          - 381896 / 35688 just after mount

          - 644040 / 309240 after 1st cp -Rp

          - 644040 / 310128 after 1st rm -rf

          - 709576 / 310128 after 1st kill -USR1

          - 840648 / 421964 after 2nd cp -Rp

          - 840648 / 422224 after 2nd rm -rf

          
          I created a small script that performs these actions in an
          infinite loop:

          while /bin/true

          do

            cp -Rp /usr/* /root/MNT/

            + get VSZ/RSS of glusterfs process

            rm -rf /root/MNT/*

            + get VSZ/RSS of glusterfs process

          done

          
          At this time here are the values so far:

          971720 533988

          1037256 645500

          1037256 645840

          1168328 757348

          1168328 757620

          1299400 869128

          1299400 869328

          1364936 980712

          1364936 980944

          1496008 1092384

          1496008 1092404

          1627080 1203796

          1627080 1203996

          1692616 1315572

          1692616 1315504

          1823688 1426812

          1823688 1427340

          1954760 1538716

          1954760 1538772

          2085832 1647676

          2085832 1647708

          2151368 1750392

          2151368 1750708

          2282440 1853864

          2282440 1853764

          2413512 1952668

          2413512 1952704

          2479048 2056500

          2479048 2056712

          
          So at this time glusterfs process takes not far from 2Gb of
          resident memory, only performing exactly the same actions 'cp
          -Rp /usr/* /root/MNT' + 'rm -rf /root/MNT/*'.

          
          Swap usage is starting to increase a little, and I don't saw
          any memory dropping at this time.

          I can understand that kernel may not release the removed files
          (after rm -rf) immediatly, but the fist 'rm' occured at ~12:00
          today and it is ~17:00 here so I can't understand why so much
          memory is used.

          I would expect the memory to grow during 'cp -Rp', then reduce
          after 'rm', but it stays the same. Even if it stays the same I
          would expect it to not grow more while cp-ing again.

          
          I let the cp/rm loop running to see what will happen. Feel
          free to ask for other data if it may help.

          
          Please note that I'll be in hollidays at the end of this week
          for 3 weeks so I will mostly not be able to perform tests
          during this time (network connection is too bad where I go).

          
          Regards,

          --

          Y.

          
          Le 02/08/2016 à 05:11, Pranith Kumar Karampuri a écrit :

        
              On Mon, Aug 1, 2016 at 3:40 PM,
                Yannick Perret <yannick.perret@xxxxxxxxxxxxx>
                wrote:

                
                      Le 29/07/2016 à 18:39, Pranith Kumar
                        Karampuri a écrit :

                      
                          On
                              Fri, Jul 29, 2016 at 2:26 PM, Yannick
                              Perret <yannick.perret@xxxxxxxxxxxxx>
                              wrote:

                            
                            Ok, last try:

                                after investigating more versions I
                                found that FUSE client leaks memory on
                                all of them.

                                I tested:

                                - 3.6.7 client on debian 7 32bit and on
                                debian 8 64bit (with 3.6.7 serveurs on
                                debian 8 64bit)

                                - 3.6.9 client on debian 7 32bit and on
                                debian 8 64bit (with 3.6.7 serveurs on
                                debian 8 64bit)

                               - 3.7.13 client on
                                debian 8 64bit (with 3.8.1 serveurs on
                                debian 8 64bit)

                                - 3.8.1 client on debian 8 64bit (with
                                3.8.1 serveurs on debian 8 64bit)

                                In all cases compiled from sources,
                                appart for 3.8.1 where .deb were used
                                (due to a configure runtime error).

                                For 3.7 it was compiled with
                                --disable-tiering. I also tried to
                                compile with --disable-fusermount (no
                                change).

                                
                                In all of these cases the memory
                                (resident & virtual) of glusterfs
                                process on client grows on each activity
                                and never reach a max (and never
                                reduce).

                                "Activity" for these tests is cp -Rp and
                                ls -lR.

                                The client I let grows the most
                                overreached ~4Go RAM. On smaller
                                machines it ends by OOM killer killing
                                glusterfs process or glusterfs dying due
                                to allocation error.

                                
                                In 3.6 mem seems to grow continusly,
                                whereas in 3.8.1 it grows by "steps"
                                (430400 ko → 629144 (~1min) → 762324
                                (~1min) → 827860…).

                                
                                All tests performed on a single test
                                volume used only by my test client.
                                Volume in a basic x2 replica. The only
                                parameters I changed on this volume
                                (without any effect) are
                                diagnostics.client-log-level set to
                                ERROR and network.inode-lru-limit set to
                                1024.

                              
                              Could you attach statedumps of your
                                runs?

                                The following link has steps to capture
                                this(https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/
                                ). We basically need to see what are the
                                memory types that are increasing. If you
                                could help find the issue, we can send
                                the fixes for your workload. There is a
                                3.8.2 release in around 10 days I think.
                                We can probably target this issue for
                                that?

                              
                     Here are statedumps.

                      Steps:

                      1. mount -t glusterfs ldap1.my.domain:SHARE
                      /root/MNT/ (here VSZ and RSS are 381896 35828)

                      2. take a dump with kill -USR1
                      <pid-of-glusterfs-process> (file
                      glusterdump.n1.dump.1470042769)

                      3. perform a 'ls -lR /root/MNT | wc -l' (btw
                      result of wc -l is 518396 :)) and a 'cp -Rp /usr/*
                      /root/MNT/boo' (VSZ/RSS are 1301536/711992 at end
                      of these operations)

                      4. take a dump with kill -USR1
                      <pid-of-glusterfs-process> (file
                      glusterdump.n2.dump.1470043929)

                      5. do 'cp -Rp * /root/MNT/toto/', so on an other
                      directory (VSZ/RSS are 1432608/909968 at end of
                      this operation)

                      6. take a dump with kill -USR1
                      <pid-of-glusterfs-process> (file
                      glusterdump.n3.dump.)

                    
                Hey,

                
                      Thanks a lot for providing this information.
                  Looking at these steps, I don't see any problem for
                  the increase in memory. Both ls -lR and cp -Rp
                  commands you did in the step-3 will add new inodes in
                  memory which increase the memory. What happens is as
                  long as the kernel thinks these inodes need to be in
                  memory gluster keeps them in memory. Once kernel
                  doesn't think the inode is necessary, it sends
                  'inode-forgets'. At this point the memory starts
                  reducing. So it kind of depends on the memory pressure
                  kernel is under. But you said it lead to OOM-killers
                  on smaller machines which means there could be some
                  leaks. Could you modify the steps as follows to check
                  to confirm there are leaks? Please do this test on
                  those smaller machines which lead to OOM-killers.

                
                  Steps:

                    1. mount -t glusterfs ldap1.my.domain:SHARE
                    /root/MNT/ (here VSZ and RSS are 381896 35828)

                    2. perform a 'ls -lR /root/MNT | wc -l' (btw result
                    of wc -l is 518396 :)) and a 'cp -Rp /usr/*
                    /root/MNT/boo' (VSZ/RSS are 1301536/711992 at end of
                    these operations)

                    3. do 'cp -Rp * /root/MNT/toto/', so on an other
                    directory (VSZ/RSS are 1432608/909968 at end of this
                    operation)

                  
                 4. Delete all the files and
                    directories you created in steps 2, 3 above

                  
                5. Take statedump with kill -USR1
                    <pid-of-glusterfs-process>

                  
                6. Repeat steps from 2-5

                    
                Attach these two statedumps. I think
                    the statedumps will be even more affective if the
                    mount does not have any data when you start the
                    experiment.

                  
                HTH

                   
                     Dump files are gzip'ed because they are very
                    large.

                    Dump files are here (too big for email):

                    http://wikisend.com/download/623430/glusterdump.n1.dump.1470042769.gz

                    http://wikisend.com/download/771220/glusterdump.n2.dump.1470043929.gz

                    http://wikisend.com/download/428752/glusterdump.n3.dump.1470045181.gz

                    (I keep the files if someone whats them in an other
                    format)

                      
                      Client and servers are installed from .deb files
                      (glusterfs-client_3.8.1-1_amd64.deb and
                      glusterfs-common_3.8.1-1_amd64.deb on client
                      side).

                      They are all Debian 8 64bit. Servers are test
                      machines that serve only one volume to this sole
                      client. Volume is a simple x2 replica. I just
                      changed for test network.inode-lru-limit value to
                      1024. Mount point /root/MNT is only used for these
                      tests.

                      
                      --

                      Y.

                      
              -- 

              
                Pranith

                
        _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
      
      
      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
    
    
Attachment:
smime.p7s

Description: Signature cryptographique S/MIME
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users