Re: [RFC][PATCH 0/13] Per-container dcache management (and a bit more)

Pavel Emelyanov <xemul@xxxxxxxxxxxxx> · Tue, 10 May 2011 15:18:27 +0400

>> No, you don't take into account that once we have the dentry cache shrunk
>> the inode cache can be also shrunk (since there's no objects other than
>> dentries, that hold inodes in cache), but not the vice versa. That said -- 
>> if we keep the dentry cache from growing it becomes possible to keep the 
>> inode cache from growing.
> 
> That's a fairly naive view of the way the caches interact. Unlike
> dentries, inodes can be dirtied and can't be reclaimed until they
> are clean. That requires IO. 

The dirty set ITSELF provides big problems, regardless of the inode cache.

Besides, the dirty page *can* be cleaned, thus releasing the inode, while the
dentry held by open file can *not*.

Moreover, the dirty set cannot grow infinitely, thus the amount of this "hard
to reclaim inodes" is limited as well. And the scales of both are not comparable
typically... It's the matter of policy, not the inability.

> Hence the inode cache can't be
> reclaimed as easily as the dentry cache, nor can controlling the
> size of the dentry cache control the size of the inode
> cache. At best, it's a second order effect.

I'm not talking about the control, I'm talking about - will you be able to free
the memory *at* *all* or not.

> Effectively, what you have is:
> 
> 	L1 cache = dentry cache
> 	L2 cache = VFS inode cache,
> 			pinned by L1,
> 			pinned by dirty state
> 	L3 cache = 1st level FS inode cache,
> 			pinned by L2
> 			pinned by dirty state
> 
> None of the cache sizes are fixed, and overall size is limited only
> by RAM, so you will always tend to have the L3 cache dominate memory
> usage because:
> 
> 	a) they are the largest objects in the heirarchy; and
> 	b) they are pinned by the L1 and L2 caches and need to be
> 	freed from those caches first.
> 
> If you limit the size of the L2/L3 inode cache, you immediately
> limit the size of the dentry cache for everything but heavy users of
> hard links. If you can't allocate more inodes, you can't allocate a
> new dentry.

Hm! You propose to manage the Linux page cache by managing the number of
cached inodes? Sounds like a plan ;)

>> 1. Can you be more specific on this - which parts of VFS suffer from the
>> LRU being global?
> 
> Performance. It doesn't scale beyond a few CPUs before lock
> contention becomes the limiting factor.

1. lru lock is global now, regardless of where the dentries are stored, on global
   list or per-sb list;
2. If we're talking about "the possibility to", then per-sb lock is significantly
   less optimization-friendly as compared to per-abstract-mob one. Create mobs
   per-sb and be happy (again).

>> The only thing I found was the problem with shrinking
>> the dcache for some sb on umount, but in my patch #4 I made both routines
>> doing it work on dentry tree, not the LRU list and thus the global LRU is
>> no longer an issue at this point.
> 
> Actually, it is, because you've still got to remove the dentry from
> the LRU to free it, which means traversing the global lock.

You still got to remove the dentry from the per-SB LRU to free it, which
means traversing the global lock. So, where's the catch?

>> 2. If for any reason you do need to keep LRU per super block (please share
>> one if you do) we can create mobs per super block :) In other words - with
>> mobs we're much more flexible with how to manage dentry LRU-s rather than
>> with per-sb LRU-s.
> 
> Because of the heirarchical nature of the caches, and the fact that
> we've got to jump through hoops to make sure the superblock doesn't
> go away while we are doing a shrinker walk (the s_umount lock
> problem). Move to a per-sb shrinker means the shrinker callback has
> the same life cycle as the superblock, and we no longer have a big
> mess of locking and lifecycle concerns in memory reclaim.
> 
> On top of that, a single shrinker callout that shrinkers the dentry,
> VFS inode and FS inode caches in a single call means we do larger
> chunks of work on each superblock at a time instead of a small
> handful of dentries or inodes per shrinker call as the current
> "proportion across all sbs" code currently works. That will give
> reclaim a smaller CPU cache footprint with higher hit rates, so
> should significantly reduce the CPU usage of shrinking the caches as
> well.
> 
> Not to mention having a per-sb shrinker means that you can call the
> shrinker from inode allocation when you run out of inodes, and it
> will shrink the dentry cache, the VFS inode cache and the FS inode
> cache in the correct order to free up inodes as quickly as possible
> to allow the new inode allocation to occur....

Yet again - this all sounds great, but why didn't you think on my proposal to 
create mobs per super block? This will solve not only my problems, but also 
all of *yours*.

>>>> The first 5 patches are preparations for this, descriptive (I hope)
>>>> comments are inside them.
>>>>
>>>> The general idea of this set is -- make the dentries subtrees be
>>>> limited in size and shrink them as they hit the configured limit.
>>>
>>> And if the inode cache that does not shrink with it?
>>
>> Yet again - that's not a big deal. Once we killed dentries, the inodes are
>> no longer pinned in memory and the very first try_to_free_pages can free them.
> 
> See above - the inode cache does not shrink in proportion with the
> dentry cache.

See above - having the big amount of pinned dentries give you no chance to 
free any inode, not the vice versa.

> <sigh>
> 
> I never implied quotas were for limiting cache usage. I only
> suggested they were the solution to your DOS example by preventing
> unbound numbers of inodes from being created by an unprivileged
> user.

As I've shown above this does NOT prevent you from DOS. This ... trick with
quota cannot be considered to solve *any* problems with memory.

> To me, it sounds like you overprovision your servers and then

Not *my* servers. But this doesn't matter.

> have major troubles when everyone tries to use what you supplied
> them with simultaneously. There is a simple solution to that. ;)

That's  how people use Linux (and Containers) - they do overcommit
resources and hard-limiting everything to the physical ability of a
host is not always the best way to go.

> Otherwise, I think you need to directly limit the size of the inode
> caches, not try to do it implicitly via 2nd and 3rd order side
> effects of controlling the size of the dentry cache.

As I stated above - limiting the inode cache size will have uncontrollable
effect of the Linux ability to manage the page cache.

> Cheers,
> 
> Dave.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html