Re: possible memory leak in client/fuse mount

Olaf Buitelaar <olaf.buitelaar@xxxxxxxxx> · Thu, 26 Nov 2020 11:30:04 +0100

Hi Ravi,
I could try that, but i can only try a setup on VM's, and will not be able to setup an environment like our production environment.
Which runs on physical machines, and has actual production load etc. So the 2 setups would be quite different.
Personally i think it would be best debug the actual machines instead of trying to reproduce it. Since the reproduction of the issue on the physical machines is just swap the repositories and upgrade the packages.
Let me know what you think?

Thanks Olaf

Op do 26 nov. 2020 om 02:43 schreef Ravishankar N <ravishankar@xxxxxxxxxx>:

    On 25/11/20 7:17 pm, Olaf Buitelaar
      wrote:

      Hi Ravi,

        Thanks for checking. Unfortunately this is our production
          system, what i've done is simple change the yum repo from
          gluter-6 to http://mirror.centos.org/centos/$releasever/storage/$basearch/gluster-7/.
          Did a yum upgrade. And did restart the glusterd process
          several times, i've also tried rebooting the machine. And
          didn't touch the op-version yet, which is still at (60000),
          usually i only do this when all nodes are upgraded, and are
          running stable.
        We're running multiple volumes with different
          configurations, but for none of the volumes the shd starts on
          the upgraded nodes.
        Is there anything further i could check/do to get to the
          bottom of this?

    Hi Olaf, like I said, would it be possible to create a test setup
      to see if you can recreate it?

    Regards,

    Ravi

        Thanks Olaf

        Op wo 25 nov. 2020 om 14:14
          schreef Ravishankar N <ravishankar@xxxxxxxxxx>:

            On 25/11/20 5:50 pm, Olaf Buitelaar wrote:

              Hi Ashish,

                Thank you for looking into this. I indeed also
                  suspect it has something todo with the 7.X client,
                  because on the 6.X clients the issue doesn't really
                  seem to occur.
                I would love to update everything to 7.X, But since
                  the self-heal daemons (https://lists.gluster.org/pipermail/gluster-users/2020-November/038917.html)
                  won't start, i halted the full upgrade. 

            Olaf, based on your email. I did try to upgrade a 1 node
              of a 3-node replica 3 setup from 6.10 to 7.8 on my test
              VMs and I found that the self-heal daemon (and the bricks)
              came online after I restarted glusterd post-upgrade on
              that node. (I did not touch the op-version), and I did not
              spend time on it further.  So I don't think the problem is
              related to the shd mux changes I referred to. But if you
              have a test setup where you can reproduce this, please
              raise a github issue with the details.

            Thanks,

            Ravi

                Hopefully that issue will be addressed in the
                  upcoming release. Once i've everything running on the
                  same version i'll check if the issue still occurs and
                  reach out, if that's the case.

                Thanks Olaf

                Op wo 25 nov. 2020 om
                  10:42 schreef Ashish Pandey <aspandey@xxxxxxxxxx>:

                      Hi,

                      I checked the statedump and found some very
                        high memory allocations.

                      grep -rwn "num_allocs"
                        glusterdump.17317.dump.1605* | cut -d'=' -f2 |
                        sort

                        30003616 

                        30003616 

                        3305 

                        3305 

                        36960008 

                        36960008 

                        38029944 

                        38029944 

                        38450472 

                        38450472 

                        39566824 

                        39566824 

                        4 

                        I did check the lines on statedump and it could
                        be happening in protocol/clinet. However, I did
                        not find anything suspicious in my quick code
                        exploration.

                      I would suggest to upgrade all the nodes on
                        latest version and the start your work and see
                        if there is any high usage of memory .

                      That way it will also be easier to debug this
                        issue.

                      ---

                      Ashish

                      From:
                        "Olaf Buitelaar" <olaf.buitelaar@xxxxxxxxx>

                        To: "gluster-users" <gluster-users@xxxxxxxxxxx>

                        Sent: Thursday, November 19, 2020
                        10:28:57 PM

                        Subject:  possible memory
                        leak in client/fuse mount

                        Dear Gluster Users,

                          I've a glusterfs process which consumes
                            about all memory of the machine (~58GB);

                          # ps -faxu|grep 17317

                            root     17317  3.1 88.9 59695516 58479708 ?
                              Ssl  Oct31 839:36 /usr/sbin/glusterfs
                            --process-name fuse
                            --volfile-server=10.201.0.1
                            --volfile-server=10.201.0.8:10.201.0.5:10.201.0.6:10.201.0.7:10.201.0.9
                            --volfile-id=/docker2 /mnt/docker2

                          The gluster version on this machine is
                            7.8, but i'm currently running a mixed
                            cluster of 6.10 and 7.8, while awaiting to
                            proceed to upgrade for the issue mentioned
                            earlier with the self-heal daemon.

                          The affected volume info looks like;

                          # gluster v info docker2

                          Volume Name: docker2

                          Type: Distributed-Replicate

                          Volume ID:
                          4e0670a0-3d00-4360-98bd-3da844cedae5

                          Status: Started

                          Snapshot Count: 0

                          Number of Bricks: 3 x (2 + 1) = 9

                          Transport-type: tcp

                          Bricks:

                          Brick1:
                          10.201.0.5:/data0/gfs/bricks/brick1/docker2

                          Brick2:
                          10.201.0.9:/data0/gfs/bricks/brick1/docker2

                          Brick3:
                          10.201.0.3:/data0/gfs/bricks/bricka/docker2
                          (arbiter)

                          Brick4:
                          10.201.0.6:/data0/gfs/bricks/brick1/docker2

                          Brick5:
                          10.201.0.7:/data0/gfs/bricks/brick1/docker2

                          Brick6:
                          10.201.0.4:/data0/gfs/bricks/bricka/docker2
                          (arbiter)

                          Brick7:
                          10.201.0.1:/data0/gfs/bricks/brick1/docker2

                          Brick8:
                          10.201.0.8:/data0/gfs/bricks/brick1/docker2

                          Brick9:
                          10.201.0.2:/data0/gfs/bricks/bricka/docker2
                          (arbiter)

                          Options Reconfigured:

                          performance.cache-size: 128MB

                          transport.address-family: inet

                          nfs.disable: on

                          cluster.brick-multiplex: on

                          The issue seems to be triggered by a
                            program called zammad, which has an init
                            process, which runs in a loop. on cycle it
                            re-compiles the ruby-on-rails application.

                          I've attached 2 statedumps, but as i only
                            recently noticed the high memory usage, i
                            believe both statedumps already show an
                            escalated state of the glusterfs process. If
                            it's needed to also have them from the
                            beginning let me know. The dumps are taken
                            about an hour apart.
                          Also i've included the glusterd.log. I
                            couldn't include mnt-docker2.log since it's
                            too large, since it's littered with: " I
                            [MSGID: 109066]
                            [dht-rename.c:1951:dht_rename]
                            0-docker2-dht"
                          However i've inspected the log and it
                            contains no Error message's all are of the
                            Info kind;
                          which look like these;
                          [2020-11-19 03:29:05.406766] I
                            [glusterfsd-mgmt.c:2282:mgmt_getspec_cbk]
                            0-glusterfs: No change in volfile,continuing

                            [2020-11-19 03:29:21.271886] I
                            [socket.c:865:__socket_shutdown]
                            0-docker2-client-8: intentional socket
                            shutdown(5)

                            [2020-11-19 03:29:24.479738] I
                            [socket.c:865:__socket_shutdown]
                            0-docker2-client-2: intentional socket
                            shutdown(5)

                            [2020-11-19 03:30:12.318146] I
                            [socket.c:865:__socket_shutdown]
                            0-docker2-client-5: intentional socket
                            shutdown(5)

                            [2020-11-19 03:31:27.381720] I
                            [socket.c:865:__socket_shutdown]
                            0-docker2-client-8: intentional socket
                            shutdown(5)

                            [2020-11-19 03:31:30.579630] I
                            [socket.c:865:__socket_shutdown]
                            0-docker2-client-2: intentional socket
                            shutdown(5)

                            [2020-11-19 03:32:18.427364] I
                            [socket.c:865:__socket_shutdown]
                            0-docker2-client-5: intentional socket
                            shutdown(5)

                          The rename messages look like these;
                          [2020-11-19 03:29:05.402663] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/95/75f93c20e375c5.tmp.eVcE5D
                            (fe083b7e-b0d5-485c-8666-e1f7cdac33e2)
                            (hash=docker2-replicate-2/cache=docker2-replicate-2)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/95/75f93c20e375c5
                            ((null))
                            (hash=docker2-replicate-2/cache=<nul>)

                            [2020-11-19 03:29:05.410972] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/0d/86dd25f3d238ff.tmp.AdDTLu
                            (b1edadad-1d48-4bf4-be85-ffbe9d69d338)
                            (hash=docker2-replicate-1/cache=docker2-replicate-1)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/0d/86dd25f3d238ff
                            ((null))
                            (hash=docker2-replicate-2/cache=<nul>)

                            [2020-11-19 03:29:05.420064] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/f2/6e44f76b508fd3.tmp.QKmxul
                            (31f80fcb-977c-433b-9259-5fdfcad1171c)
                            (hash=docker2-replicate-0/cache=docker2-replicate-0)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/f2/6e44f76b508fd3
                            ((null))
                            (hash=docker2-replicate-0/cache=<nul>)

                            [2020-11-19 03:29:05.427537] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/b0/1d7303d9dfe009.tmp.qLUMec
                            (e2fdf971-731f-4765-80e8-3165433488ea)
                            (hash=docker2-replicate-2/cache=docker2-replicate-2)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/b0/1d7303d9dfe009
                            ((null))
                            (hash=docker2-replicate-1/cache=<nul>)

                            [2020-11-19 03:29:05.440576] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/bd/952a089e164b36.tmp.4qvl22
                            (3e0bc6d1-13ac-47c6-b221-1256b4b506ef)
                            (hash=docker2-replicate-2/cache=docker2-replicate-2)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/bd/952a089e164b36
                            ((null))
                            (hash=docker2-replicate-1/cache=<nul>)

                            [2020-11-19 03:29:05.452407] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/a3/b587dd08f35e2e.tmp.iIweTT
                            (9685b5f3-4b14-4050-9b00-1163856239b5)
                            (hash=docker2-replicate-1/cache=docker2-replicate-1)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/a3/b587dd08f35e2e
                            ((null))
                            (hash=docker2-replicate-0/cache=<nul>)

                            [2020-11-19 03:29:05.460720] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/48/89cfb1b971c025.tmp.0W7jMK
                            (d0a8d0a4-c783-45db-bb4a-68e24044d830)
                            (hash=docker2-replicate-0/cache=docker2-replicate-0)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/48/89cfb1b971c025
                            ((null))
                            (hash=docker2-replicate-1/cache=<nul>)

                            [2020-11-19 03:29:05.468800] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/d9/759d55e8da66eb.tmp.2yXtHB
                            (e5b61ef5-a3c2-4a2c-aa47-c377a6c090d7)
                            (hash=docker2-replicate-0/cache=docker2-replicate-0)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/d9/759d55e8da66eb
                            ((null))
                            (hash=docker2-replicate-0/cache=<nul>)

                            [2020-11-19 03:29:05.476745] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/1c/f3a658342e36b7.tmp.gSkiEs
                            (17181a40-f9b2-438f-9dfc-7bb159c516e6)
                            (hash=docker2-replicate-2/cache=docker2-replicate-2)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/1c/f3a658342e36b7
                            ((null))
                            (hash=docker2-replicate-0/cache=<nul>)

                            [2020-11-19 03:29:05.486729] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/f1/6bef7cb6446c7a.tmp.sVT0Dj
                            (cb6b1d52-b1c0-420c-86b7-2ceb8e8e73db)
                            (hash=docker2-replicate-0/cache=docker2-replicate-0)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/f1/6bef7cb6446c7a
                            ((null))
                            (hash=docker2-replicate-1/cache=<nul>)

                            [2020-11-19 03:29:05.495115] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/45/73ba226559961b.tmp.QdPTFa
                            (d8450d9e-62a7-4fd5-9dd2-e072e318d9a0)
                            (hash=docker2-replicate-0/cache=docker2-replicate-0)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/45/73ba226559961b
                            ((null))
                            (hash=docker2-replicate-1/cache=<nul>)

                            [2020-11-19 03:29:05.503424] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/13/29c0df35961ca0.tmp.s1xUJ1
                            (ffc57a77-8b91-4264-8e2d-a9966f0f37ef)
                            (hash=docker2-replicate-1/cache=docker2-replicate-1)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/13/29c0df35961ca0
                            ((null))
                            (hash=docker2-replicate-2/cache=<nul>)

                            [2020-11-19 03:29:05.513532] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/be/8d6a07b6a0d6ad.tmp.A5DzQS
                            (5a595a65-372d-4377-b547-2c4e23f7be3a)
                            (hash=docker2-replicate-1/cache=docker2-replicate-1)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/be/8d6a07b6a0d6ad
                            ((null))
                            (hash=docker2-replicate-0/cache=<nul>)

                            [2020-11-19 03:29:05.526885] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/ec/4208216d993cbe.tmp.IMXg0J
                            (2fa99fcd-64f8-4934-aeda-b356816f1132)
                            (hash=docker2-replicate-2/cache=docker2-replicate-2)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/ec/4208216d993cbe
                            ((null))
                            (hash=docker2-replicate-2/cache=<nul>)

                            [2020-11-19 03:29:05.537637] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/57/1527c482cf2d6b.tmp.Y2L0cB
                            (db24d7bf-4a06-4356-a52e-1ab9537d1c3a)
                            (hash=docker2-replicate-0/cache=docker2-replicate-0)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/57/1527c482cf2d6b
                            ((null))
                            (hash=docker2-replicate-1/cache=<nul>)

                            [2020-11-19 03:29:05.547878] I [MSGID:
                            109066] [dht-rename.c:1951:dht_rename]
                            0-docker2-dht: renaming
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/88/1b60ead8d4c4e5.tmp.u47rss
                            (b12f041b-5bbd-4e3d-b700-8f673830393f)
                            (hash=docker2-replicate-1/cache=docker2-replicate-1)
                            =>
/corporate/zammad/tmp/init/cache/bootsnap-compile-cache/88/1b60ead8d4c4e5
                            ((null))
                            (hash=docker2-replicate-1/cache=<nul>)

                          if i can provide any more information
                            please let me know.

                          Thanks Olaf

                        ________

                        Community Meeting Calendar:

                        Schedule -

                        Every 2nd and 4th Tuesday at 14:30 IST / 09:00
                        UTC

                        Bridge: https://meet.google.com/cpu-eiue-hvk

                        Gluster-users mailing list

                        Gluster-users@xxxxxxxxxxx

                        https://lists.gluster.org/mailman/listinfo/gluster-users

              ________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users