Re: Run away memory with gluster mount

Nithya Balachandran <nbalacha@xxxxxxxxxx> · Fri, 2 Feb 2018 12:43:56 +0530

Hi Dan,
It sounds like you might be running into [1]. The patch has been posted upstream and the fix should be in the next release.
In the meantime, I'm afraid there is no way to get around this without restarting the process.

Regards,
Nithya

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1541264

On 2 February 2018 at 02:57, Dan Ragle <daniel@xxxxxxxxxxxxxx> wrote:

On 1/30/2018 6:31 AM, Raghavendra Gowdappa wrote:

----- Original Message -----

From: "Dan Ragle" <daniel@xxxxxxxxxxxxxx>

To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Ravishankar N" <ravishankar@xxxxxxxxxx>

Cc: gluster-users@xxxxxxxxxxx, "Csaba Henk" <chenk@xxxxxxxxxx>, "Niels de Vos" <ndevos@xxxxxxxxxx>, "Nithya

Balachandran" <nbalacha@xxxxxxxxxx>

Sent: Monday, January 29, 2018 9:02:21 PM

Subject: Re:  Run away memory with gluster mount

On 1/29/2018 2:36 AM, Raghavendra Gowdappa wrote:

----- Original Message -----

From: "Ravishankar N" <ravishankar@xxxxxxxxxx>

To: "Dan Ragle" <daniel@xxxxxxxxxxxxxx>, gluster-users@xxxxxxxxxxx

Cc: "Csaba Henk" <chenk@xxxxxxxxxx>, "Niels de Vos" <ndevos@xxxxxxxxxx>,

"Nithya Balachandran" <nbalacha@xxxxxxxxxx>,

"Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>

Sent: Saturday, January 27, 2018 10:23:38 AM

Subject: Re:  Run away memory with gluster mount

On 01/27/2018 02:29 AM, Dan Ragle wrote:

On 1/25/2018 8:21 PM, Ravishankar N wrote:

On 01/25/2018 11:04 PM, Dan Ragle wrote:

*sigh* trying again to correct formatting ... apologize for the

earlier mess.

Having a memory issue with Gluster 3.12.4 and not sure how to

troubleshoot. I don't *think* this is expected behavior.

This is on an updated CentOS 7 box. The setup is a simple two node

replicated layout where the two nodes act as both server and

client.

The volume in question:

Volume Name: GlusterWWW

Type: Replicate

Volume ID: 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3

Status: Started

Snapshot Count: 0

Number of Bricks: 1 x 2 = 2

Transport-type: tcp

Bricks:

Brick1: vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www

Brick2: vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www

Options Reconfigured:

nfs.disable: on

cluster.favorite-child-policy: mtime

transport.address-family: inet

I had some other performance options in there, (increased

cache-size, md invalidation, etc) but stripped them out in an

attempt to

isolate the issue. Still got the problem without them.

The volume currently contains over 1M files.

When mounting the volume, I get (among other things) a process as such:

/usr/sbin/glusterfs --volfile-server=localhost

--volfile-id=/GlusterWWW /var/www

This process begins with little memory, but then as files are

accessed in the volume the memory increases. I setup a script that

simply reads the files in the volume one at a time (no writes). It's

been running on and off about 12 hours now and the resident

memory of the above process is already at 7.5G and continues to grow

slowly. If I stop the test script the memory stops growing,

but does not reduce. Restart the test script and the memory begins

slowly growing again.

This is obviously a contrived app environment. With my intended

application load it takes about a week or so for the memory to get

high enough to invoke the oom killer.

Can you try debugging with the statedump

(https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump)

of

the fuse mount process and see what member is leaking? Take the

statedumps in succession, maybe once initially during the I/O and

once the memory gets high enough to hit the OOM mark.

Share the dumps here.

Regards,

Ravi

Thanks for the reply. I noticed yesterday that an update (3.12.5) had

been posted so I went ahead and updated and repeated the test

overnight. The memory usage does not appear to be growing as quickly

as is was with 3.12.4, but does still appear to be growing.

I should also mention that there is another process beyond my test app

that is reading the files from the volume. Specifically, there is an

rsync that runs from the second node 2-4 times an hour that reads from

the GlusterWWW volume mounted on node 1. Since none of the files in

that mount are changing it doesn't actually rsync anything, but

nonetheless it is running and reading the files in addition to my test

script. (It's a part of my intended production setup that I forgot was

still running.)

The mount process appears to be gaining memory at a rate of about 1GB

every 4 hours or so. At that rate it'll take several days before it

runs the box out of memory. But I took your suggestion and made some

statedumps today anyway, about 2 hours apart, 4 total so far. It looks

like there may already be some actionable information. These are the

only registers where the num_allocs have grown with each of the four

samples:

[mount/fuse.fuse - usage-type gf_fuse_mt_gids_t memusage]

   ---> num_allocs at Fri Jan 26 08:57:31 2018: 784

   ---> num_allocs at Fri Jan 26 10:55:50 2018: 831

   ---> num_allocs at Fri Jan 26 12:55:15 2018: 877

   ---> num_allocs at Fri Jan 26 14:58:27 2018: 908

[mount/fuse.fuse - usage-type gf_common_mt_fd_lk_ctx_t memusage]

   ---> num_allocs at Fri Jan 26 08:57:31 2018: 5

   ---> num_allocs at Fri Jan 26 10:55:50 2018: 10

   ---> num_allocs at Fri Jan 26 12:55:15 2018: 15

   ---> num_allocs at Fri Jan 26 14:58:27 2018: 17

[cluster/distribute.GlusterWWW-dht - usage-type gf_dht_mt_dht_layout_t

memusage]

   ---> num_allocs at Fri Jan 26 08:57:31 2018: 24243596

   ---> num_allocs at Fri Jan 26 10:55:50 2018: 27902622

   ---> num_allocs at Fri Jan 26 12:55:15 2018: 30678066

   ---> num_allocs at Fri Jan 26 14:58:27 2018: 33801036

Not sure the best way to get you the full dumps. They're pretty big,

over 1G for all four. Also, I noticed some filepath information in

there that I'd rather not share. What's the recommended next step?

Please run the following query on statedump files and report us the

results:

# grep itable <client-statedump> | grep active | wc -l

# grep itable <client-statedump> | grep active_size

# grep itable <client-statedump> | grep lru | wc -l

# grep itable <client-statedump> | grep lru_size

# grep itable <client-statedump> | grep purge | wc -l

# grep itable <client-statedump> | grep purge_size

Had to restart the test and have been running for 36 hours now. RSS is

currently up to 23g.

Working on getting a bug report with link to the dumps. In the mean

time, I'm including the results of your above queries for the first

dump, the 18 hour dump, and the 36 hour dump:

# grep itable glusterdump.153904.dump.1517104561 | grep active | wc -l

53865

# grep itable glusterdump.153904.dump.1517169361 | grep active | wc -l

53864

# grep itable glusterdump.153904.dump.1517234161 | grep active | wc -l

53864

# grep itable glusterdump.153904.dump.1517104561 | grep active_size

xlator.mount.fuse.itable.active_size=53864

# grep itable glusterdump.153904.dump.1517169361 | grep active_size

xlator.mount.fuse.itable.active_size=53863

# grep itable glusterdump.153904.dump.1517234161 | grep active_size

xlator.mount.fuse.itable.active_size=53863

# grep itable glusterdump.153904.dump.1517104561 | grep lru | wc -l

998510

# grep itable glusterdump.153904.dump.1517169361 | grep lru | wc -l

998510

# grep itable glusterdump.153904.dump.1517234161 | grep lru | wc -l

995992

# grep itable glusterdump.153904.dump.1517104561 | grep lru_size

xlator.mount.fuse.itable.lru_size=998508

# grep itable glusterdump.153904.dump.1517169361 | grep lru_size

xlator.mount.fuse.itable.lru_size=998508

# grep itable glusterdump.153904.dump.1517234161 | grep lru_size

xlator.mount.fuse.itable.lru_size=995990

Around 1 million of inodes in lru table!! These are the inodes kernel has just cached and no operation is currently progress on these inodes. This could be the reason for high memory usage. We've a patch being worked on (merged on experimental branch currently) [1], that will help in these sceanrios. In the meantime can you remount glusterfs with options --entry-timeout=0 and --attribute-timeout=0? This will make sure that kernel won't cache inodes/attributes of the file and should bring down the memory usage.

I am curious to know what is your data-set like? Is it the case of too many directories and files present in deep directories? I am wondering whether a significant number of inodes cached by kernel are there to hold dentry structure in kernel.

[1] https://review.gluster.org/#/c/18665/

OK, remounted with your recommended attributes and repeated the test. Now the mount process looks like this:

/usr/sbin/glusterfs --attribute-timeout=0 --entry-timeout=0 --volfile-server=localhost --volfile-id=/GlusterWWW /var/www

However after running for 36 hours it's again at about 23g (about the same place it was on the first test).

A few metrics from the 36 hour mark:

num_allocs for [cluster/distribute.GlusterWWW-dht - usage-type gf_dht_mt_dht_layout_t memusage] is 109140094. Seems at least somewhat similar to the original test, which had 117901593 at the 36 hour mark.

The dump file at the 36 hour mark had nothing for lru or lru_size. However, at the dump two hours prior it had:

# grep itable glusterdump.67299.dump.1517493361 | grep lru | wc -l

998510

# grep itable glusterdump.67299.dump.1517493361 | grep lru_size

xlator.mount.fuse.itable.lru_size=998508

and the same thing for the dump four hours later. Are these values only relevant when the ls -R is actually running? I'm thinking the 36 hour dump may have caught the ls -R between runs there (?)

The data set is multiple Web sites. I know there's some litter there we can clean up, but I'd guess not more than 200-300k files or so. The biggest culprit is a single directory that we use as a multi-purpose file store, with filenames stored as GUIDs and linked to a DB. That directory currently has 500k+ files. Another directory serves a similar purpose and has about 66k files in it. The rest is generally distributed more "normally", I.E., a mixed nesting of directories and files.

Cheers!

Dan

# grep itable glusterdump.153904.dump.1517104561 | grep purge | wc -l

1

# grep itable glusterdump.153904.dump.1517169361 | grep purge | wc -l

1

# grep itable glusterdump.153904.dump.1517234161 | grep purge | wc -l

1

# grep itable glusterdump.153904.dump.1517104561 | grep purge_size

xlator.mount.fuse.itable.purge_size=0

# grep itable glusterdump.153904.dump.1517169361 | grep purge_size

xlator.mount.fuse.itable.purge_size=0

# grep itable glusterdump.153904.dump.1517234161 | grep purge_size

xlator.mount.fuse.itable.purge_size=0

Cheers,

Dan

I've CC'd the fuse/ dht devs to see if these data types have potential

leaks. Could you raise a bug with the volume info and a (dropbox?) link

from which we can download the dumps? You can remove/replace the

filepaths from them.

Regards.

Ravi

Cheers!

Dan

Is there potentially something misconfigured here?

I did see a reference to a memory leak in another thread in this

list, but that had to do with the setting of quotas, I don't have

any quotas set on my system.

Thanks,

Dan Ragle

daniel@xxxxxxxxxxxxxx

On 1/25/2018 11:04 AM, Dan Ragle wrote:

Having a memory issue with Gluster 3.12.4 and not sure how to

troubleshoot. I don't *think* this is expected behavior. This is on an

updated CentOS 7 box. The setup is a simple two node replicated layout

where the two nodes act as both server and client. The volume in

question: Volume Name: GlusterWWW Type: Replicate Volume ID:

8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 Status: Started Snapshot Count: 0

Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1:

vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www Brick2:

vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www Options

Reconfigured:

nfs.disable: on cluster.favorite-child-policy: mtime

transport.address-family: inet I had some other performance options in

there, (increased cache-size, md invalidation, etc) but stripped them

out in an attempt to isolate the issue. Still got the problem without

them. The volume currently contains over 1M files. When mounting the

volume, I get (among other things) a process as such:

/usr/sbin/glusterfs --volfile-server=localhost

--volfile-id=/GlusterWWW

/var/www This process begins with little memory, but then as files are

accessed in the volume the memory increases. I setup a script that

simply reads the files in the volume one at a time (no writes). It's

been running on and off about 12 hours now and the resident memory of

the above process is already at 7.5G and continues to grow slowly.

If I

stop the test script the memory stops growing, but does not reduce.

Restart the test script and the memory begins slowly growing again.

This

is obviously a contrived app environment. With my intended application

load it takes about a week or so for the memory to get high enough to

invoke the oom killer. Is there potentially something misconfigured

here? Thanks, Dan Ragle daniel@xxxxxxxxxxxxxx

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users