----- Original Message ----- > From: "Ravishankar N" <ravishankar@xxxxxxxxxx> > To: "Dan Ragle" <daniel@xxxxxxxxxxxxxx>, gluster-users@xxxxxxxxxxx > Cc: "Csaba Henk" <chenk@xxxxxxxxxx>, "Niels de Vos" <ndevos@xxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx>, > "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > Sent: Saturday, January 27, 2018 10:23:38 AM > Subject: Re: Run away memory with gluster mount > > > > On 01/27/2018 02:29 AM, Dan Ragle wrote: > > > > On 1/25/2018 8:21 PM, Ravishankar N wrote: > >> > >> > >> On 01/25/2018 11:04 PM, Dan Ragle wrote: > >>> *sigh* trying again to correct formatting ... apologize for the > >>> earlier mess. > >>> > >>> Having a memory issue with Gluster 3.12.4 and not sure how to > >>> troubleshoot. I don't *think* this is expected behavior. > >>> > >>> This is on an updated CentOS 7 box. The setup is a simple two node > >>> replicated layout where the two nodes act as both server and > >>> client. > >>> > >>> The volume in question: > >>> > >>> Volume Name: GlusterWWW > >>> Type: Replicate > >>> Volume ID: 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 > >>> Status: Started > >>> Snapshot Count: 0 > >>> Number of Bricks: 1 x 2 = 2 > >>> Transport-type: tcp > >>> Bricks: > >>> Brick1: vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www > >>> Brick2: vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www > >>> Options Reconfigured: > >>> nfs.disable: on > >>> cluster.favorite-child-policy: mtime > >>> transport.address-family: inet > >>> > >>> I had some other performance options in there, (increased > >>> cache-size, md invalidation, etc) but stripped them out in an > >>> attempt to > >>> isolate the issue. Still got the problem without them. > >>> > >>> The volume currently contains over 1M files. > >>> > >>> When mounting the volume, I get (among other things) a process as such: > >>> > >>> /usr/sbin/glusterfs --volfile-server=localhost > >>> --volfile-id=/GlusterWWW /var/www > >>> > >>> This process begins with little memory, but then as files are > >>> accessed in the volume the memory increases. I setup a script that > >>> simply reads the files in the volume one at a time (no writes). It's > >>> been running on and off about 12 hours now and the resident > >>> memory of the above process is already at 7.5G and continues to grow > >>> slowly. If I stop the test script the memory stops growing, > >>> but does not reduce. Restart the test script and the memory begins > >>> slowly growing again. > >>> > >>> This is obviously a contrived app environment. With my intended > >>> application load it takes about a week or so for the memory to get > >>> high enough to invoke the oom killer. > >> > >> Can you try debugging with the statedump > >> (https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump) > >> of > >> the fuse mount process and see what member is leaking? Take the > >> statedumps in succession, maybe once initially during the I/O and > >> once the memory gets high enough to hit the OOM mark. > >> Share the dumps here. > >> > >> Regards, > >> Ravi > > > > Thanks for the reply. I noticed yesterday that an update (3.12.5) had > > been posted so I went ahead and updated and repeated the test > > overnight. The memory usage does not appear to be growing as quickly > > as is was with 3.12.4, but does still appear to be growing. > > > > I should also mention that there is another process beyond my test app > > that is reading the files from the volume. Specifically, there is an > > rsync that runs from the second node 2-4 times an hour that reads from > > the GlusterWWW volume mounted on node 1. Since none of the files in > > that mount are changing it doesn't actually rsync anything, but > > nonetheless it is running and reading the files in addition to my test > > script. (It's a part of my intended production setup that I forgot was > > still running.) > > > > The mount process appears to be gaining memory at a rate of about 1GB > > every 4 hours or so. At that rate it'll take several days before it > > runs the box out of memory. But I took your suggestion and made some > > statedumps today anyway, about 2 hours apart, 4 total so far. It looks > > like there may already be some actionable information. These are the > > only registers where the num_allocs have grown with each of the four > > samples: > > > > [mount/fuse.fuse - usage-type gf_fuse_mt_gids_t memusage] > > ---> num_allocs at Fri Jan 26 08:57:31 2018: 784 > > ---> num_allocs at Fri Jan 26 10:55:50 2018: 831 > > ---> num_allocs at Fri Jan 26 12:55:15 2018: 877 > > ---> num_allocs at Fri Jan 26 14:58:27 2018: 908 > > > > [mount/fuse.fuse - usage-type gf_common_mt_fd_lk_ctx_t memusage] > > ---> num_allocs at Fri Jan 26 08:57:31 2018: 5 > > ---> num_allocs at Fri Jan 26 10:55:50 2018: 10 > > ---> num_allocs at Fri Jan 26 12:55:15 2018: 15 > > ---> num_allocs at Fri Jan 26 14:58:27 2018: 17 > > > > [cluster/distribute.GlusterWWW-dht - usage-type gf_dht_mt_dht_layout_t > > memusage] > > ---> num_allocs at Fri Jan 26 08:57:31 2018: 24243596 > > ---> num_allocs at Fri Jan 26 10:55:50 2018: 27902622 > > ---> num_allocs at Fri Jan 26 12:55:15 2018: 30678066 > > ---> num_allocs at Fri Jan 26 14:58:27 2018: 33801036 > > > > Not sure the best way to get you the full dumps. They're pretty big, > > over 1G for all four. Also, I noticed some filepath information in > > there that I'd rather not share. What's the recommended next step? Please run the following query on statedump files and report us the results: # grep itable <client-statedump> | grep active | wc -l # grep itable <client-statedump> | grep active_size # grep itable <client-statedump> | grep lru | wc -l # grep itable <client-statedump> | grep lru_size # grep itable <client-statedump> | grep purge | wc -l # grep itable <client-statedump> | grep purge_size > > I've CC'd the fuse/ dht devs to see if these data types have potential > leaks. Could you raise a bug with the volume info and a (dropbox?) link > from which we can download the dumps? You can remove/replace the > filepaths from them. > > Regards. > Ravi > > > > > Cheers! > > > > Dan > > > >>> > >>> Is there potentially something misconfigured here? > >>> > >>> I did see a reference to a memory leak in another thread in this > >>> list, but that had to do with the setting of quotas, I don't have > >>> any quotas set on my system. > >>> > >>> Thanks, > >>> > >>> Dan Ragle > >>> daniel@xxxxxxxxxxxxxx > >>> > >>> On 1/25/2018 11:04 AM, Dan Ragle wrote: > >>>> Having a memory issue with Gluster 3.12.4 and not sure how to > >>>> troubleshoot. I don't *think* this is expected behavior. This is on an > >>>> updated CentOS 7 box. The setup is a simple two node replicated layout > >>>> where the two nodes act as both server and client. The volume in > >>>> question: Volume Name: GlusterWWW Type: Replicate Volume ID: > >>>> 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 Status: Started Snapshot Count: 0 > >>>> Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: > >>>> vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www Brick2: > >>>> vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www Options > >>>> Reconfigured: > >>>> nfs.disable: on cluster.favorite-child-policy: mtime > >>>> transport.address-family: inet I had some other performance options in > >>>> there, (increased cache-size, md invalidation, etc) but stripped them > >>>> out in an attempt to isolate the issue. Still got the problem without > >>>> them. The volume currently contains over 1M files. When mounting the > >>>> volume, I get (among other things) a process as such: > >>>> /usr/sbin/glusterfs --volfile-server=localhost > >>>> --volfile-id=/GlusterWWW > >>>> /var/www This process begins with little memory, but then as files are > >>>> accessed in the volume the memory increases. I setup a script that > >>>> simply reads the files in the volume one at a time (no writes). It's > >>>> been running on and off about 12 hours now and the resident memory of > >>>> the above process is already at 7.5G and continues to grow slowly. > >>>> If I > >>>> stop the test script the memory stops growing, but does not reduce. > >>>> Restart the test script and the memory begins slowly growing again. > >>>> This > >>>> is obviously a contrived app environment. With my intended application > >>>> load it takes about a week or so for the memory to get high enough to > >>>> invoke the oom killer. Is there potentially something misconfigured > >>>> here? Thanks, Dan Ragle daniel@xxxxxxxxxxxxxx > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Gluster-users mailing list > >>>> Gluster-users@xxxxxxxxxxx > >>>> http://lists.gluster.org/mailman/listinfo/gluster-users > >>>> > >>> _______________________________________________ > >>> Gluster-users mailing list > >>> Gluster-users@xxxxxxxxxxx > >>> http://lists.gluster.org/mailman/listinfo/gluster-users > >> > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users@xxxxxxxxxxx > > http://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users