Re: Run away memory with gluster mount

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Mon, 5 Feb 2018 14:44:10 -0500 (EST)

Hi Dan,

I had a suggestion and a question in my previous response. Let us know whether the suggestion helps and please let us know about your data-set (like how many directories/files and how these directories/files are organised) to understand the problem better.

<snip>

>         In the
>         meantime can you remount glusterfs with options
>         --entry-timeout=0 and --attribute-timeout=0? This will make sure
>         that kernel won't cache inodes/attributes of the file and should
>         bring down the memory usage.
>
>         I am curious to know what is your data-set like? Is it the case
>         of too many directories and files present in deep directories? I
>         am wondering whether a significant number of inodes cached by
>         kernel are there to hold dentry structure in kernel.

</snip>

regards,
Raghavendra

----- Original Message -----
> From: "Dan Ragle" <daniel@xxxxxxxxxxxxxx>
> To: "Nithya Balachandran" <nbalacha@xxxxxxxxxx>
> Cc: "gluster-users" <gluster-users@xxxxxxxxxxx>, "Csaba Henk" <chenk@xxxxxxxxxx>
> Sent: Saturday, February 3, 2018 7:28:15 PM
> Subject: Re:  Run away memory with gluster mount
> 
> 
> 
> On 2/2/2018 2:13 AM, Nithya Balachandran wrote:
> > Hi Dan,
> > 
> > It sounds like you might be running into [1]. The patch has been posted
> > upstream and the fix should be in the next release.
> > In the meantime, I'm afraid there is no way to get around this without
> > restarting the process.
> > 
> > Regards,
> > Nithya
> > 
> > [1]https://bugzilla.redhat.com/show_bug.cgi?id=1541264
> > 
> 
> Much appreciated. Will watch for the next release and retest then.
> 
> Cheers!
> 
> Dan
> 
> > 
> > On 2 February 2018 at 02:57, Dan Ragle <daniel@xxxxxxxxxxxxxx
> > <mailto:daniel@xxxxxxxxxxxxxx>> wrote:
> > 
> > 
> > 
> >     On 1/30/2018 6:31 AM, Raghavendra Gowdappa wrote:
> > 
> > 
> > 
> >         ----- Original Message -----
> > 
> >             From: "Dan Ragle" <daniel@xxxxxxxxxxxxxx>
> >             To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx
> >             <mailto:rgowdapp@xxxxxxxxxx>>, "Ravishankar N"
> >             <ravishankar@xxxxxxxxxx <mailto:ravishankar@xxxxxxxxxx>>
> >             Cc: gluster-users@xxxxxxxxxxx
> >             <mailto:gluster-users@xxxxxxxxxxx>, "Csaba Henk"
> >             <chenk@xxxxxxxxxx <mailto:chenk@xxxxxxxxxx>>, "Niels de Vos"
> >             <ndevos@xxxxxxxxxx <mailto:ndevos@xxxxxxxxxx>>, "Nithya
> >             Balachandran" <nbalacha@xxxxxxxxxx
> >             <mailto:nbalacha@xxxxxxxxxx>>
> >             Sent: Monday, January 29, 2018 9:02:21 PM
> >             Subject: Re:  Run away memory with gluster mount
> > 
> > 
> > 
> >             On 1/29/2018 2:36 AM, Raghavendra Gowdappa wrote:
> > 
> > 
> > 
> >                 ----- Original Message -----
> > 
> >                     From: "Ravishankar N" <ravishankar@xxxxxxxxxx
> >                     <mailto:ravishankar@xxxxxxxxxx>>
> >                     To: "Dan Ragle" <daniel@xxxxxxxxxxxxxx>,
> >                     gluster-users@xxxxxxxxxxx
> >                     <mailto:gluster-users@xxxxxxxxxxx>
> >                     Cc: "Csaba Henk" <chenk@xxxxxxxxxx
> >                     <mailto:chenk@xxxxxxxxxx>>, "Niels de Vos"
> >                     <ndevos@xxxxxxxxxx <mailto:ndevos@xxxxxxxxxx>>,
> >                     "Nithya Balachandran" <nbalacha@xxxxxxxxxx
> >                     <mailto:nbalacha@xxxxxxxxxx>>,
> >                     "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx
> >                     <mailto:rgowdapp@xxxxxxxxxx>>
> >                     Sent: Saturday, January 27, 2018 10:23:38 AM
> >                     Subject: Re:  Run away memory with
> >                     gluster mount
> > 
> > 
> > 
> >                     On 01/27/2018 02:29 AM, Dan Ragle wrote:
> > 
> > 
> >                         On 1/25/2018 8:21 PM, Ravishankar N wrote:
> > 
> > 
> > 
> >                             On 01/25/2018 11:04 PM, Dan Ragle wrote:
> > 
> >                                 *sigh* trying again to correct
> >                                 formatting ... apologize for the
> >                                 earlier mess.
> > 
> >                                 Having a memory issue with Gluster
> >                                 3.12.4 and not sure how to
> >                                 troubleshoot. I don't *think* this is
> >                                 expected behavior.
> > 
> >                                 This is on an updated CentOS 7 box. The
> >                                 setup is a simple two node
> >                                 replicated layout where the two nodes
> >                                 act as both server and
> >                                 client.
> > 
> >                                 The volume in question:
> > 
> >                                 Volume Name: GlusterWWW
> >                                 Type: Replicate
> >                                 Volume ID:
> >                                 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3
> >                                 Status: Started
> >                                 Snapshot Count: 0
> >                                 Number of Bricks: 1 x 2 = 2
> >                                 Transport-type: tcp
> >                                 Bricks:
> >                                 Brick1:
> >                                 vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www
> >                                 Brick2:
> >                                 vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www
> >                                 Options Reconfigured:
> >                                 nfs.disable: on
> >                                 cluster.favorite-child-policy: mtime
> >                                 transport.address-family: inet
> > 
> >                                 I had some other performance options in
> >                                 there, (increased
> >                                 cache-size, md invalidation, etc) but
> >                                 stripped them out in an
> >                                 attempt to
> >                                 isolate the issue. Still got the problem
> >                                 without them.
> > 
> >                                 The volume currently contains over 1M
> >                                 files.
> > 
> >                                 When mounting the volume, I get (among
> >                                 other things) a process as such:
> > 
> >                                 /usr/sbin/glusterfs
> >                                 --volfile-server=localhost
> >                                 --volfile-id=/GlusterWWW /var/www
> > 
> >                                 This process begins with little memory,
> >                                 but then as files are
> >                                 accessed in the volume the memory
> >                                 increases. I setup a script that
> >                                 simply reads the files in the volume one
> >                                 at a time (no writes). It's
> >                                 been running on and off about 12 hours
> >                                 now and the resident
> >                                 memory of the above process is already
> >                                 at 7.5G and continues to grow
> >                                 slowly. If I stop the test script the
> >                                 memory stops growing,
> >                                 but does not reduce. Restart the test
> >                                 script and the memory begins
> >                                 slowly growing again.
> > 
> >                                 This is obviously a contrived app
> >                                 environment. With my intended
> >                                 application load it takes about a week
> >                                 or so for the memory to get
> >                                 high enough to invoke the oom killer.
> > 
> > 
> >                             Can you try debugging with the statedump
> >                             (https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump
> >                             <https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump>)
> >                             of
> >                             the fuse mount process and see what member
> >                             is leaking? Take the
> >                             statedumps in succession, maybe once
> >                             initially during the I/O and
> >                             once the memory gets high enough to hit the
> >                             OOM mark.
> >                             Share the dumps here.
> > 
> >                             Regards,
> >                             Ravi
> > 
> > 
> >                         Thanks for the reply. I noticed yesterday that
> >                         an update (3.12.5) had
> >                         been posted so I went ahead and updated and
> >                         repeated the test
> >                         overnight. The memory usage does not appear to
> >                         be growing as quickly
> >                         as is was with 3.12.4, but does still appear to
> >                         be growing.
> > 
> >                         I should also mention that there is another
> >                         process beyond my test app
> >                         that is reading the files from the volume.
> >                         Specifically, there is an
> >                         rsync that runs from the second node 2-4 times
> >                         an hour that reads from
> >                         the GlusterWWW volume mounted on node 1. Since
> >                         none of the files in
> >                         that mount are changing it doesn't actually
> >                         rsync anything, but
> >                         nonetheless it is running and reading the files
> >                         in addition to my test
> >                         script. (It's a part of my intended production
> >                         setup that I forgot was
> >                         still running.)
> > 
> >                         The mount process appears to be gaining memory
> >                         at a rate of about 1GB
> >                         every 4 hours or so. At that rate it'll take
> >                         several days before it
> >                         runs the box out of memory. But I took your
> >                         suggestion and made some
> >                         statedumps today anyway, about 2 hours apart, 4
> >                         total so far. It looks
> >                         like there may already be some actionable
> >                         information. These are the
> >                         only registers where the num_allocs have grown
> >                         with each of the four
> >                         samples:
> > 
> >                         [mount/fuse.fuse - usage-type gf_fuse_mt_gids_t
> >                         memusage]
> >                             ---> num_allocs at Fri Jan 26 08:57:31 2018:
> >                             784
> >                             ---> num_allocs at Fri Jan 26 10:55:50 2018:
> >                             831
> >                             ---> num_allocs at Fri Jan 26 12:55:15 2018:
> >                             877
> >                             ---> num_allocs at Fri Jan 26 14:58:27 2018:
> >                             908
> > 
> >                         [mount/fuse.fuse - usage-type
> >                         gf_common_mt_fd_lk_ctx_t memusage]
> >                             ---> num_allocs at Fri Jan 26 08:57:31 2018: 5
> >                             ---> num_allocs at Fri Jan 26 10:55:50 2018: 10
> >                             ---> num_allocs at Fri Jan 26 12:55:15 2018: 15
> >                             ---> num_allocs at Fri Jan 26 14:58:27 2018: 17
> > 
> >                         [cluster/distribute.GlusterWWW-dht - usage-type
> >                         gf_dht_mt_dht_layout_t
> >                         memusage]
> >                             ---> num_allocs at Fri Jan 26 08:57:31 2018:
> >                         24243596
> >                             ---> num_allocs at Fri Jan 26 10:55:50 2018:
> >                         27902622
> >                             ---> num_allocs at Fri Jan 26 12:55:15 2018:
> >                         30678066
> >                             ---> num_allocs at Fri Jan 26 14:58:27 2018:
> >                         33801036
> > 
> >                         Not sure the best way to get you the full dumps.
> >                         They're pretty big,
> >                         over 1G for all four. Also, I noticed some
> >                         filepath information in
> >                         there that I'd rather not share. What's the
> >                         recommended next step?
> > 
> > 
> >                 Please run the following query on statedump files and
> >                 report us the
> >                 results:
> >                 # grep itable <client-statedump> | grep active | wc -l
> >                 # grep itable <client-statedump> | grep active_size
> >                 # grep itable <client-statedump> | grep lru | wc -l
> >                 # grep itable <client-statedump> | grep lru_size
> >                 # grep itable <client-statedump> | grep purge | wc -l
> >                 # grep itable <client-statedump> | grep purge_size
> > 
> > 
> >             Had to restart the test and have been running for 36 hours
> >             now. RSS is
> >             currently up to 23g.
> > 
> >             Working on getting a bug report with link to the dumps. In
> >             the mean
> >             time, I'm including the results of your above queries for
> >             the first
> >             dump, the 18 hour dump, and the 36 hour dump:
> > 
> >             # grep itable glusterdump.153904.dump.1517104561 | grep
> >             active | wc -l
> >             53865
> >             # grep itable glusterdump.153904.dump.1517169361 | grep
> >             active | wc -l
> >             53864
> >             # grep itable glusterdump.153904.dump.1517234161 | grep
> >             active | wc -l
> >             53864
> > 
> >             # grep itable glusterdump.153904.dump.1517104561 | grep
> >             active_size
> >             xlator.mount.fuse.itable.active_size=53864
> >             # grep itable glusterdump.153904.dump.1517169361 | grep
> >             active_size
> >             xlator.mount.fuse.itable.active_size=53863
> >             # grep itable glusterdump.153904.dump.1517234161 | grep
> >             active_size
> >             xlator.mount.fuse.itable.active_size=53863
> > 
> >             # grep itable glusterdump.153904.dump.1517104561 | grep lru
> >             | wc -l
> >             998510
> >             # grep itable glusterdump.153904.dump.1517169361 | grep lru
> >             | wc -l
> >             998510
> >             # grep itable glusterdump.153904.dump.1517234161 | grep lru
> >             | wc -l
> >             995992
> > 
> >             # grep itable glusterdump.153904.dump.1517104561 | grep
> >             lru_size
> >             xlator.mount.fuse.itable.lru_size=998508
> >             # grep itable glusterdump.153904.dump.1517169361 | grep
> >             lru_size
> >             xlator.mount.fuse.itable.lru_size=998508
> >             # grep itable glusterdump.153904.dump.1517234161 | grep
> >             lru_size
> >             xlator.mount.fuse.itable.lru_size=995990
> > 
> > 
> >         Around 1 million of inodes in lru table!! These are the inodes
> >         kernel has just cached and no operation is currently progress on
> >         these inodes. This could be the reason for high memory usage.
> >         We've a patch being worked on (merged on experimental branch
> >         currently) [1], that will help in these sceanrios. In the
> >         meantime can you remount glusterfs with options
> >         --entry-timeout=0 and --attribute-timeout=0? This will make sure
> >         that kernel won't cache inodes/attributes of the file and should
> >         bring down the memory usage.
> > 
> >         I am curious to know what is your data-set like? Is it the case
> >         of too many directories and files present in deep directories? I
> >         am wondering whether a significant number of inodes cached by
> >         kernel are there to hold dentry structure in kernel.
> > 
> >         [1] https://review.gluster.org/#/c/18665/
> >         <https://review.gluster.org/#/c/18665/>
> > 
> > 
> >     OK, remounted with your recommended attributes and repeated the
> >     test. Now the mount process looks like this:
> > 
> >     /usr/sbin/glusterfs --attribute-timeout=0 --entry-timeout=0
> >     --volfile-server=localhost --volfile-id=/GlusterWWW /var/www
> > 
> >     However after running for 36 hours it's again at about 23g (about
> >     the same place it was on the first test).
> > 
> >     A few metrics from the 36 hour mark:
> > 
> >     num_allocs for [cluster/distribute.GlusterWWW-dht - usage-type
> >     gf_dht_mt_dht_layout_t memusage] is 109140094. Seems at least
> >     somewhat similar to the original test, which had 117901593 at the 36
> >     hour mark.
> > 
> >     The dump file at the 36 hour mark had nothing for lru or lru_size.
> >     However, at the dump two hours prior it had:
> > 
> >     # grep itable glusterdump.67299.dump.1517493361 | grep lru | wc -l
> >     998510
> >     # grep itable glusterdump.67299.dump.1517493361 | grep lru_size
> >     xlator.mount.fuse.itable.lru_size=998508
> > 
> >     and the same thing for the dump four hours later. Are these values
> >     only relevant when the ls -R is actually running? I'm thinking the
> >     36 hour dump may have caught the ls -R between runs there (?)
> > 
> >     The data set is multiple Web sites. I know there's some litter there
> >     we can clean up, but I'd guess not more than 200-300k files or so.
> >     The biggest culprit is a single directory that we use as a
> >     multi-purpose file store, with filenames stored as GUIDs and linked
> >     to a DB. That directory currently has 500k+ files. Another directory
> >     serves a similar purpose and has about 66k files in it. The rest is
> >     generally distributed more "normally", I.E., a mixed nesting of
> >     directories and files.
> > 
> >     Cheers!
> > 
> >     Dan
> > 
> > 
> > 
> >             # grep itable glusterdump.153904.dump.1517104561 | grep
> >             purge | wc -l
> >             1
> >             # grep itable glusterdump.153904.dump.1517169361 | grep
> >             purge | wc -l
> >             1
> >             # grep itable glusterdump.153904.dump.1517234161 | grep
> >             purge | wc -l
> >             1
> > 
> >             # grep itable glusterdump.153904.dump.1517104561 | grep
> >             purge_size
> >             xlator.mount.fuse.itable.purge_size=0
> >             # grep itable glusterdump.153904.dump.1517169361 | grep
> >             purge_size
> >             xlator.mount.fuse.itable.purge_size=0
> >             # grep itable glusterdump.153904.dump.1517234161 | grep
> >             purge_size
> >             xlator.mount.fuse.itable.purge_size=0
> > 
> >             Cheers,
> > 
> >             Dan
> > 
> > 
> > 
> >                     I've CC'd the fuse/ dht devs to see if these data
> >                     types have potential
> >                     leaks. Could you raise a bug with the volume info
> >                     and a (dropbox?) link
> >                     from which we can download the dumps? You can
> >                     remove/replace the
> >                     filepaths from them.
> > 
> >                     Regards.
> >                     Ravi
> > 
> > 
> >                         Cheers!
> > 
> >                         Dan
> > 
> > 
> >                                 Is there potentially something
> >                                 misconfigured here?
> > 
> >                                 I did see a reference to a memory leak
> >                                 in another thread in this
> >                                 list, but that had to do with the
> >                                 setting of quotas, I don't have
> >                                 any quotas set on my system.
> > 
> >                                 Thanks,
> > 
> >                                 Dan Ragle
> >                                 daniel@xxxxxxxxxxxxxx
> > 
> >                                 On 1/25/2018 11:04 AM, Dan Ragle wrote:
> > 
> >                                     Having a memory issue with Gluster
> >                                     3.12.4 and not sure how to
> >                                     troubleshoot. I don't *think* this
> >                                     is expected behavior. This is on an
> >                                     updated CentOS 7 box. The setup is a
> >                                     simple two node replicated layout
> >                                     where the two nodes act as both
> >                                     server and client. The volume in
> >                                     question: Volume Name: GlusterWWW
> >                                     Type: Replicate Volume ID:
> >                                     8e9b0e79-f309-4d9b-a5bb-45d065faaaa3
> >                                     Status: Started Snapshot Count: 0
> >                                     Number of Bricks: 1 x 2 = 2
> >                                     Transport-type: tcp Bricks: Brick1:
> >                                     vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www
> >                                     Brick2:
> >                                     vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www
> >                                     Options
> >                                     Reconfigured:
> >                                     nfs.disable: on
> >                                     cluster.favorite-child-policy: mtime
> >                                     transport.address-family: inet I had
> >                                     some other performance options in
> >                                     there, (increased cache-size, md
> >                                     invalidation, etc) but stripped them
> >                                     out in an attempt to isolate the
> >                                     issue. Still got the problem without
> >                                     them. The volume currently contains
> >                                     over 1M files. When mounting the
> >                                     volume, I get (among other things) a
> >                                     process as such:
> >                                     /usr/sbin/glusterfs
> >                                     --volfile-server=localhost
> >                                     --volfile-id=/GlusterWWW
> >                                     /var/www This process begins with
> >                                     little memory, but then as files are
> >                                     accessed in the volume the memory
> >                                     increases. I setup a script that
> >                                     simply reads the files in the volume
> >                                     one at a time (no writes). It's
> >                                     been running on and off about 12
> >                                     hours now and the resident memory of
> >                                     the above process is already at 7.5G
> >                                     and continues to grow slowly.
> >                                     If I
> >                                     stop the test script the memory
> >                                     stops growing, but does not reduce.
> >                                     Restart the test script and the
> >                                     memory begins slowly growing again.
> >                                     This
> >                                     is obviously a contrived app
> >                                     environment. With my intended
> >                                     application
> >                                     load it takes about a week or so for
> >                                     the memory to get high enough to
> >                                     invoke the oom killer. Is there
> >                                     potentially something misconfigured
> >                                     here? Thanks, Dan Ragle
> >                                     daniel@xxxxxxxxxxxxxx
> > 
> > 
> > 
> > 
> >                                     _______________________________________________
> >                                     Gluster-users mailing list
> >                                     Gluster-users@xxxxxxxxxxx
> >                                     <mailto:Gluster-users@xxxxxxxxxxx>
> >                                     http://lists.gluster.org/mailman/listinfo/gluster-users
> >                                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
> > 
> >                                 _______________________________________________
> >                                 Gluster-users mailing list
> >                                 Gluster-users@xxxxxxxxxxx
> >                                 <mailto:Gluster-users@xxxxxxxxxxx>
> >                                 http://lists.gluster.org/mailman/listinfo/gluster-users
> >                                 <http://lists.gluster.org/mailman/listinfo/gluster-users>
> > 
> > 
> >                         _______________________________________________
> >                         Gluster-users mailing list
> >                         Gluster-users@xxxxxxxxxxx
> >                         <mailto:Gluster-users@xxxxxxxxxxx>
> >                         http://lists.gluster.org/mailman/listinfo/gluster-users
> >                         <http://lists.gluster.org/mailman/listinfo/gluster-users>
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users