Re: Fuse mounts and inodes

Raghavendra G <raghavendra@xxxxxxxxxxx> · Wed, 6 Sep 2017 20:19:16 +0530

On Wed, Sep 6, 2017 at 11:16 AM, Csaba Henk <chenk@xxxxxxxxxx> wrote:
Thanks Du, nice bit of info! It made me wander about the following:

- Could it be then the default answer we give to "glusterfs client

high memory usage"

  type of complaints to set vfs_cache_pressure to 100 + x? 

- And then x = ? Was there proper performance testing done to see how

performance /

  mem consumtion changes in terms of vfs_cache_performace?

I had a discussion with Manoj on this. One drawback with using vfs_cache_performance tunable is that its a dynamic algorithm which decides whether to purge from page cache or inode cache looking at the current memory pressure. An obvious drawback for glusterfs is that, various of caches of glusterfs are not visible to kernel (Memory consumed by Glusterfs gets reflected neither in page cache nor in inode cache). This _might_ result in algorithm working poorly.

- vfs_cache_pressure is an allover system tunable. If 100 + x is ideal

for GlusterFS, can

  we take the courage to propose this? Is there no risk to trash other

(disk-based)

  filesystems' performace?

That's a valid point. Behavior of other filesystems would be a concern.

I've not really thought through this suggestion of tuning /proc/sys/vm tunables and I am not even an expert who knows what tunables are at our disposal. Just wanted to bring this idea to notice of wider audience.

Csaba

On Wed, Sep 6, 2017 at 6:57 AM, Raghavendra G <raghavendra@xxxxxxxxxxx> wrote:

> Another parallel effort could be trying to configure the number of

> inodes/dentries cached by kernel VFS using /proc/sys/vm interface.

>

> ==============================================================

>

> vfs_cache_pressure

> ------------------

>

> This percentage value controls the tendency of the kernel to reclaim

> the memory which is used for caching of directory and inode objects.

>

> At the default value of vfs_cache_pressure=100 the kernel will attempt to

> reclaim dentries and inodes at a "fair" rate with respect to pagecache and

> swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to

> prefer

> to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel

> will

> never reclaim dentries and inodes due to memory pressure and this can easily

> lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100

> causes the kernel to prefer to reclaim dentries and inodes.

>

> Increasing vfs_cache_pressure significantly beyond 100 may have negative

> performance impact. Reclaim code needs to take various locks to find

> freeable

> directory and inode objects. With vfs_cache_pressure=1000, it will look for

> ten times more freeable objects than there are.

>

> Also we've an article for sysadmins which has a section:

>

> <quote>

>

> With GlusterFS, many users with a lot of storage and many small files

> easily end up using a lot of RAM on the server side due to

> 'inode/dentry' caching, leading to decreased performance when the kernel

> keeps crawling through data-structures on a 40GB RAM system. Changing

> this value higher than 100 has helped many users to achieve fair caching

> and more responsiveness from the kernel.

>

> </quote>

>

> Complete article can be found at:

> https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Linux%20Kernel%20Tuning/

>

> regards,

>

>

> On Tue, Sep 5, 2017 at 5:20 PM, Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx>

> wrote:

>>

>> +gluster-devel

>>

>> Ashish just spoke to me about need of GC of inodes due to some state in

>> inode that is being proposed in EC. Hence adding more people to

>> conversation.

>>

>> > > On 4 September 2017 at 12:34, Csaba Henk <chenk@xxxxxxxxxx> wrote:

>> > >

>> > > > I don't know, depends on how sophisticated GC we need/want/can get

>> > > > by. I

>> > > > guess the complexity will be inherent, ie. that of the algorithm

>> > > > chosen

>> > > > and

>> > > > how we address concurrency & performance impacts, but once that's

>> > > > got

>> > > > right

>> > > > the other aspects of implementation won't be hard.

>> > > >

>> > > > Eg. would it be good just to maintain a simple LRU list?

>> > > >

>> >

>> > Yes. I was also thinking of leveraging lru list. We can invalidate first

>> > "n"

>> > inodes from lru list of fuse inode table.

>> >

>> > >

>> > > That might work for starters.

>> > >

>> > > >

>> > > > Csaba

>> > > >

>> > > > On Mon, Sep 4, 2017 at 8:48 AM, Nithya Balachandran

>> > > > <nbalacha@xxxxxxxxxx>

>> > > > wrote:

>> > > >

>> > > >>

>> > > >>

>> > > >> On 4 September 2017 at 12:14, Csaba Henk <chenk@xxxxxxxxxx> wrote:

>> > > >>

>> > > >>> Basically how I see the fuse invalidate calls as rescuers of

>> > > >>> sanity.

>> > > >>>

>> > > >>> Normally, when you have lot of certain kind of stuff that tends to

>> > > >>> accumulate, the immediate thought is: let's set up some garbage

>> > > >>> collection

>> > > >>> mechanism, that will take care of keeping the accumulation at bay.

>> > > >>> But

>> > > >>> that's what doesn't work with inodes in a naive way, as they are

>> > > >>> referenced

>> > > >>> from kernel, so we have to keep them around until kernel tells us

>> > > >>> it's

>> > > >>> giving up its reference. However, with the fuse invalidate calls

>> > > >>> we can

>> > > >>> take the initiative and instruct the kernel: "hey, kernel, give up

>> > > >>> your

>> > > >>> references to this thing!"

>> > > >>>

>> > > >>> So we are actually free to implement any kind of inode GC in

>> > > >>> glusterfs,

>> > > >>> just have to take care to add the proper callback to

>> > > >>> fuse_invalidate_*

>> > > >>> and

>> > > >>> we are good to go.

>> > > >>>

>> > > >>>

>> > > >> That sounds good and something we need to do in the near future. Is

>> > > >> this

>> > > >> something that is easy to implement?

>> > > >>

>> > > >>

>> > > >>> Csaba

>> > > >>>

>> > > >>> On Mon, Sep 4, 2017 at 7:00 AM, Nithya Balachandran

>> > > >>> <nbalacha@xxxxxxxxxx

>> > > >>> > wrote:

>> > > >>>

>> > > >>>>

>> > > >>>>

>> > > >>>> On 4 September 2017 at 10:25, Raghavendra Gowdappa

>> > > >>>> <rgowdapp@xxxxxxxxxx

>> > > >>>> > wrote:

>> > > >>>>

>> > > >>>>>

>> > > >>>>>

>> > > >>>>> ----- Original Message -----

>> > > >>>>> > From: "Nithya Balachandran" <nbalacha@xxxxxxxxxx>

>> > > >>>>> > Sent: Monday, September 4, 2017 10:19:37 AM

>> > > >>>>> > Subject: Fuse mounts and inodes

>> > > >>>>> >

>> > > >>>>> > Hi,

>> > > >>>>> >

>> > > >>>>> > One of the reasons for the memory consumption in gluster fuse

>> > > >>>>> > mounts

>> > > >>>>> is the

>> > > >>>>> > number of inodes in the table which are never kicked out.

>> > > >>>>> >

>> > > >>>>> > Is there any way to default to an entry-timeout and

>> > > >>>>> attribute-timeout value

>> > > >>>>> > while mounting Gluster using Fuse? Say 60s each so those

>> > > >>>>> > entries

>> > > >>>>> will be

>> > > >>>>> > purged periodically?

>> > > >>>>>

>> > > >>>>> Once the entry timeouts, inodes won't be purged. Kernel sends a

>> > > >>>>> lookup

>> > > >>>>> to revalidate the mapping of path to inode. AFAIK, reverse

>> > > >>>>> invalidation

>> > > >>>>> (see inode_invalidate) is the only way to make kernel forget

>> > > >>>>> inodes/attributes.

>> > > >>>>>

>> > > >>>>> Is that something that can be done from the Fuse mount ? Or is

>> > > >>>>> this

>> > > >>>> something that needs to be added to Fuse?

>> > > >>>>

>> > > >>>>> >

>> > > >>>>> > Regards,

>> > > >>>>> > Nithya

>> > > >>>>> >

>> > > >>>>>

>> > > >>>>

>> > > >>>>

>> > > >>>

>> > > >>

>> > > >

>> > >

>> >

>> _______________________________________________

>> Gluster-devel mailing list

>> Gluster-devel@xxxxxxxxxxx

>> http://lists.gluster.org/mailman/listinfo/gluster-devel

>

>

>

>

> --

> Raghavendra G

>

> _______________________________________________

> Gluster-devel mailing list

> Gluster-devel@xxxxxxxxxxx

> http://lists.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-devel

-- 
Raghavendra G

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel