Re: Run away memory with gluster mount

Nithya Balachandran <nbalacha@xxxxxxxxxx> · Thu, 22 Feb 2018 09:07:44 +0530

On 21 February 2018 at 21:11, Dan Ragle <daniel@xxxxxxxxxxxxxx> wrote:

On 2/3/2018 8:58 AM, Dan Ragle wrote:

On 2/2/2018 2:13 AM, Nithya Balachandran wrote:

Hi Dan,

It sounds like you might be running into [1]. The patch has been posted upstream and the fix should be in the next release.

In the meantime, I'm afraid there is no way to get around this without restarting the process.

Regards,

Nithya

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1541264

Much appreciated. Will watch for the next release and retest then.

Cheers!

Dan

FYI, this looks like it's fixed in 3.12.6. Ran the test setup with repeated ls listings for just shy of 48 hours with no increase in RAM usage. Next will try my production application load for awhile to see if it holds steady.

The gf_dht_mt_dht_layout_t memusage num_allocs went quickly up to 105415 and then stayed there for the entire 48 hours.

Excellent. Thanks for letting us know.

Nithya

Thanks for the quick response,

Dan

On 2 February 2018 at 02:57, Dan Ragle <daniel@xxxxxxxxxxxxxx <mailto:daniel@xxxxxxxxxxxxxx>> wrote:

    On 1/30/2018 6:31 AM, Raghavendra Gowdappa wrote:

        ----- Original Message -----

            From: "Dan Ragle" <daniel@xxxxxxxxxxxxxx>

            To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx

            <mailto:rgowdapp@xxxxxxxxxx>>, "Ravishankar N"

            <ravishankar@xxxxxxxxxx <mailto:ravishankar@xxxxxxxxxx>>

            Cc: gluster-users@xxxxxxxxxxx

            <mailto:gluster-users@gluster.org>, "Csaba Henk"

            <chenk@xxxxxxxxxx <mailto:chenk@xxxxxxxxxx>>, "Niels de Vos"

            <ndevos@xxxxxxxxxx <mailto:ndevos@xxxxxxxxxx>>, "Nithya

            Balachandran" <nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>>

            Sent: Monday, January 29, 2018 9:02:21 PM

            Subject: Re: [Gluster-users] Run away memory with gluster mount

            On 1/29/2018 2:36 AM, Raghavendra Gowdappa wrote:

                ----- Original Message -----

                    From: "Ravishankar N" <ravishankar@xxxxxxxxxx

                    <mailto:ravishankar@xxxxxxxxxx>>

                    To: "Dan Ragle" <daniel@xxxxxxxxxxxxxx>,

                    gluster-users@xxxxxxxxxxx

                    <mailto:gluster-users@gluster.org>

                    Cc: "Csaba Henk" <chenk@xxxxxxxxxx

                    <mailto:chenk@xxxxxxxxxx>>, "Niels de Vos"

                    <ndevos@xxxxxxxxxx <mailto:ndevos@xxxxxxxxxx>>,

                    "Nithya Balachandran" <nbalacha@xxxxxxxxxx

                    <mailto:nbalacha@xxxxxxxxxx>>,

                    "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx

                    <mailto:rgowdapp@xxxxxxxxxx>>

                    Sent: Saturday, January 27, 2018 10:23:38 AM

                    Subject: Re: [Gluster-users] Run away memory with

                    gluster mount

                    On 01/27/2018 02:29 AM, Dan Ragle wrote:

                        On 1/25/2018 8:21 PM, Ravishankar N wrote:

                            On 01/25/2018 11:04 PM, Dan Ragle wrote:

                                *sigh* trying again to correct

                                formatting ... apologize for the

                                earlier mess.

                                Having a memory issue with Gluster

                                3.12.4 and not sure how to

                                troubleshoot. I don't *think* this is

                                expected behavior.

                                This is on an updated CentOS 7 box. The

                                setup is a simple two node

                                replicated layout where the two nodes

                                act as both server and

                                client.

                                The volume in question:

                                Volume Name: GlusterWWW

                                Type: Replicate

                                Volume ID:

                                8e9b0e79-f309-4d9b-a5bb-45d065faaaa3

                                Status: Started

                                Snapshot Count: 0

                                Number of Bricks: 1 x 2 = 2

                                Transport-type: tcp

                                Bricks:

                                Brick1:

                                vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www

                                Brick2:

                                vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www

                                Options Reconfigured:

                                nfs.disable: on

                                cluster.favorite-child-policy: mtime

                                transport.address-family: inet

                                I had some other performance options in

                                there, (increased

                                cache-size, md invalidation, etc) but

                                stripped them out in an

                                attempt to

                                isolate the issue. Still got the problem

                                without them.

                                The volume currently contains over 1M files.

                                When mounting the volume, I get (among

                                other things) a process as such:

                                /usr/sbin/glusterfs

                                --volfile-server=localhost

                                --volfile-id=/GlusterWWW /var/www

                                This process begins with little memory,

                                but then as files are

                                accessed in the volume the memory

                                increases. I setup a script that

                                simply reads the files in the volume one

                                at a time (no writes). It's

                                been running on and off about 12 hours

                                now and the resident

                                memory of the above process is already

                                at 7.5G and continues to grow

                                slowly. If I stop the test script the

                                memory stops growing,

                                but does not reduce. Restart the test

                                script and the memory begins

                                slowly growing again.

                                This is obviously a contrived app

                                environment. With my intended

                                application load it takes about a week

                                or so for the memory to get

                                high enough to invoke the oom killer.

                            Can you try debugging with the statedump

                            (https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump 

                            <https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump>) 

                            of

                            the fuse mount process and see what member

                            is leaking? Take the

                            statedumps in succession, maybe once

                            initially during the I/O and

                            once the memory gets high enough to hit the

                            OOM mark.

                            Share the dumps here.

                            Regards,

                            Ravi

                        Thanks for the reply. I noticed yesterday that

                        an update (3.12.5) had

                        been posted so I went ahead and updated and

                        repeated the test

                        overnight. The memory usage does not appear to

                        be growing as quickly

                        as is was with 3.12.4, but does still appear to

                        be growing.

                        I should also mention that there is another

                        process beyond my test app

                        that is reading the files from the volume.

                        Specifically, there is an

                        rsync that runs from the second node 2-4 times

                        an hour that reads from

                        the GlusterWWW volume mounted on node 1. Since

                        none of the files in

                        that mount are changing it doesn't actually

                        rsync anything, but

                        nonetheless it is running and reading the files

                        in addition to my test

                        script. (It's a part of my intended production

                        setup that I forgot was

                        still running.)

                        The mount process appears to be gaining memory

                        at a rate of about 1GB

                        every 4 hours or so. At that rate it'll take

                        several days before it

                        runs the box out of memory. But I took your

                        suggestion and made some

                        statedumps today anyway, about 2 hours apart, 4

                        total so far. It looks

                        like there may already be some actionable

                        information. These are the

                        only registers where the num_allocs have grown

                        with each of the four

                        samples:

                        [mount/fuse.fuse - usage-type gf_fuse_mt_gids_t

                        memusage]

                            ---> num_allocs at Fri Jan 26 08:57:31 2018: 784

                            ---> num_allocs at Fri Jan 26 10:55:50 2018: 831

                            ---> num_allocs at Fri Jan 26 12:55:15 2018: 877

                            ---> num_allocs at Fri Jan 26 14:58:27 2018: 908

                        [mount/fuse.fuse - usage-type

                        gf_common_mt_fd_lk_ctx_t memusage]

                            ---> num_allocs at Fri Jan 26 08:57:31 2018: 5

                            ---> num_allocs at Fri Jan 26 10:55:50 2018: 10

                            ---> num_allocs at Fri Jan 26 12:55:15 2018: 15

                            ---> num_allocs at Fri Jan 26 14:58:27 2018: 17

                        [cluster/distribute.GlusterWWW-dht - usage-type

                        gf_dht_mt_dht_layout_t

                        memusage]

                            ---> num_allocs at Fri Jan 26 08:57:31 2018:

                        24243596

                            ---> num_allocs at Fri Jan 26 10:55:50 2018:

                        27902622

                            ---> num_allocs at Fri Jan 26 12:55:15 2018:

                        30678066

                            ---> num_allocs at Fri Jan 26 14:58:27 2018:

                        33801036

                        Not sure the best way to get you the full dumps.

                        They're pretty big,

                        over 1G for all four. Also, I noticed some

                        filepath information in

                        there that I'd rather not share. What's the

                        recommended next step?

                Please run the following query on statedump files and

                report us the

                results:

                # grep itable <client-statedump> | grep active | wc -l

                # grep itable <client-statedump> | grep active_size

                # grep itable <client-statedump> | grep lru | wc -l

                # grep itable <client-statedump> | grep lru_size

                # grep itable <client-statedump> | grep purge | wc -l

                # grep itable <client-statedump> | grep purge_size

            Had to restart the test and have been running for 36 hours

            now. RSS is

            currently up to 23g.

            Working on getting a bug report with link to the dumps. In

            the mean

            time, I'm including the results of your above queries for

            the first

            dump, the 18 hour dump, and the 36 hour dump:

            # grep itable glusterdump.153904.dump.1517104561 | grep

            active | wc -l

            53865

            # grep itable glusterdump.153904.dump.1517169361 | grep

            active | wc -l

            53864

            # grep itable glusterdump.153904.dump.1517234161 | grep

            active | wc -l

            53864

            # grep itable glusterdump.153904.dump.1517104561 | grep

            active_size

            xlator.mount.fuse.itable.active_size=53864

            # grep itable glusterdump.153904.dump.1517169361 | grep

            active_size

            xlator.mount.fuse.itable.active_size=53863

            # grep itable glusterdump.153904.dump.1517234161 | grep

            active_size

            xlator.mount.fuse.itable.active_size=53863

            # grep itable glusterdump.153904.dump.1517104561 | grep lru

            | wc -l

            998510

            # grep itable glusterdump.153904.dump.1517169361 | grep lru

            | wc -l

            998510

            # grep itable glusterdump.153904.dump.1517234161 | grep lru

            | wc -l

            995992

            # grep itable glusterdump.153904.dump.1517104561 | grep lru_size

            xlator.mount.fuse.itable.lru_size=998508

            # grep itable glusterdump.153904.dump.1517169361 | grep lru_size

            xlator.mount.fuse.itable.lru_size=998508

            # grep itable glusterdump.153904.dump.1517234161 | grep lru_size

            xlator.mount.fuse.itable.lru_size=995990

        Around 1 million of inodes in lru table!! These are the inodes

        kernel has just cached and no operation is currently progress on

        these inodes. This could be the reason for high memory usage.

        We've a patch being worked on (merged on experimental branch

        currently) [1], that will help in these sceanrios. In the

        meantime can you remount glusterfs with options

        --entry-timeout=0 and --attribute-timeout=0? This will make sure

        that kernel won't cache inodes/attributes of the file and should

        bring down the memory usage.

        I am curious to know what is your data-set like? Is it the case

        of too many directories and files present in deep directories? I

        am wondering whether a significant number of inodes cached by

        kernel are there to hold dentry structure in kernel.

        [1] https://review.gluster.org/#/c/18665/

        <https://review.gluster.org/#/c/18665/>

    OK, remounted with your recommended attributes and repeated the

    test. Now the mount process looks like this:

    /usr/sbin/glusterfs --attribute-timeout=0 --entry-timeout=0

    --volfile-server=localhost --volfile-id=/GlusterWWW /var/www

    However after running for 36 hours it's again at about 23g (about

    the same place it was on the first test).

    A few metrics from the 36 hour mark:

    num_allocs for [cluster/distribute.GlusterWWW-dht - usage-type

    gf_dht_mt_dht_layout_t memusage] is 109140094. Seems at least

    somewhat similar to the original test, which had 117901593 at the 36

    hour mark.

    The dump file at the 36 hour mark had nothing for lru or lru_size.

    However, at the dump two hours prior it had:

    # grep itable glusterdump.67299.dump.1517493361 | grep lru | wc -l

    998510

    # grep itable glusterdump.67299.dump.1517493361 | grep lru_size

    xlator.mount.fuse.itable.lru_size=998508

    and the same thing for the dump four hours later. Are these values

    only relevant when the ls -R is actually running? I'm thinking the

    36 hour dump may have caught the ls -R between runs there (?)

    The data set is multiple Web sites. I know there's some litter there

    we can clean up, but I'd guess not more than 200-300k files or so.

    The biggest culprit is a single directory that we use as a

    multi-purpose file store, with filenames stored as GUIDs and linked

    to a DB. That directory currently has 500k+ files. Another directory

    serves a similar purpose and has about 66k files in it. The rest is

    generally distributed more "normally", I.E., a mixed nesting of

    directories and files.

    Cheers!

    Dan

            # grep itable glusterdump.153904.dump.1517104561 | grep

            purge | wc -l

            1

            # grep itable glusterdump.153904.dump.1517169361 | grep

            purge | wc -l

            1

            # grep itable glusterdump.153904.dump.1517234161 | grep

            purge | wc -l

            1

            # grep itable glusterdump.153904.dump.1517104561 | grep

            purge_size

            xlator.mount.fuse.itable.purge_size=0

            # grep itable glusterdump.153904.dump.1517169361 | grep

            purge_size

            xlator.mount.fuse.itable.purge_size=0

            # grep itable glusterdump.153904.dump.1517234161 | grep

            purge_size

            xlator.mount.fuse.itable.purge_size=0

            Cheers,

            Dan

                    I've CC'd the fuse/ dht devs to see if these data

                    types have potential

                    leaks. Could you raise a bug with the volume info

                    and a (dropbox?) link

                    from which we can download the dumps? You can

                    remove/replace the

                    filepaths from them.

                    Regards.

                    Ravi

                        Cheers!

                        Dan

                                Is there potentially something

                                misconfigured here?

                                I did see a reference to a memory leak

                                in another thread in this

                                list, but that had to do with the

                                setting of quotas, I don't have

                                any quotas set on my system.

                                Thanks,

                                Dan Ragle

                                daniel@xxxxxxxxxxxxxx

                                On 1/25/2018 11:04 AM, Dan Ragle wrote:

                                    Having a memory issue with Gluster

                                    3.12.4 and not sure how to

                                    troubleshoot. I don't *think* this

                                    is expected behavior. This is on an

                                    updated CentOS 7 box. The setup is a

                                    simple two node replicated layout

                                    where the two nodes act as both

                                    server and client. The volume in

                                    question: Volume Name: GlusterWWW

                                    Type: Replicate Volume ID:

                                    8e9b0e79-f309-4d9b-a5bb-45d065faaaa3

                                    Status: Started Snapshot Count: 0

                                    Number of Bricks: 1 x 2 = 2

                                    Transport-type: tcp Bricks: Brick1:

                                    vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www

                                    Brick2:

                                    vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www

                                    Options

                                    Reconfigured:

                                    nfs.disable: on

                                    cluster.favorite-child-policy: mtime

                                    transport.address-family: inet I had

                                    some other performance options in

                                    there, (increased cache-size, md

                                    invalidation, etc) but stripped them

                                    out in an attempt to isolate the

                                    issue. Still got the problem without

                                    them. The volume currently contains

                                    over 1M files. When mounting the

                                    volume, I get (among other things) a

                                    process as such:

                                    /usr/sbin/glusterfs

                                    --volfile-server=localhost

                                    --volfile-id=/GlusterWWW

                                    /var/www This process begins with

                                    little memory, but then as files are

                                    accessed in the volume the memory

                                    increases. I setup a script that

                                    simply reads the files in the volume

                                    one at a time (no writes). It's

                                    been running on and off about 12

                                    hours now and the resident memory of

                                    the above process is already at 7.5G

                                    and continues to grow slowly.

                                    If I

                                    stop the test script the memory

                                    stops growing, but does not reduce.

                                    Restart the test script and the

                                    memory begins slowly growing again.

                                    This

                                    is obviously a contrived app

                                    environment. With my intended

                                    application

                                    load it takes about a week or so for

                                    the memory to get high enough to

                                    invoke the oom killer. Is there

                                    potentially something misconfigured

                                    here? Thanks, Dan Ragle

                                    daniel@xxxxxxxxxxxxxx

                                    _______________________________________________

                                    Gluster-users mailing list

                                    Gluster-users@xxxxxxxxxxx

                                    <mailto:Gluster-users@gluster.org>

                                    http://lists.gluster.org/mailman/listinfo/gluster-users

                                    <http://lists.gluster.org/mailman/listinfo/gluster-users>

                                _______________________________________________

                                Gluster-users mailing list

                                Gluster-users@xxxxxxxxxxx

                                <mailto:Gluster-users@gluster.org>

                                http://lists.gluster.org/mailman/listinfo/gluster-users

                                <http://lists.gluster.org/mailman/listinfo/gluster-users>

                        _______________________________________________

                        Gluster-users mailing list

                        Gluster-users@xxxxxxxxxxx

                        <mailto:Gluster-users@gluster.org>

                        http://lists.gluster.org/mailman/listinfo/gluster-users

                        <http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users