Re: [PATCH v3 0/4] Per-container dcache limitation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08/17/2011 10:27 PM, Dave Chinner wrote:
On Wed, Aug 17, 2011 at 11:44:53AM -0700, Glauber Costa wrote:
On 08/16/2011 10:43 PM, Dave Chinner wrote:
On Sun, Aug 14, 2011 at 07:13:48PM +0400, Glauber Costa wrote:
Hello,

This series is just like v2, except it addresses
Eric's comments regarding percpu variables.

Let me know if there are further comments, and
I'll promply address them as well. Otherwise,
I feel this is ready for inclusion

Hi David,

I am not answering everything now, since I'm travelling, but let me
get to this one:

Just out of couriousity, one thing I've noticed about dentries is
that in general at any given point in time most dentries are unused.
Under the workloads I'm testing, even when I have a million cached
dentries, I only have roughly 7,000 accounted as used.  That is, most
of the dentries in the system are on a LRU and accounted in
sb->s_nr_dentry_unused of their owner superblock.

So rather than introduce a bunch of new infrastructure to track the
number of dentries allocated, why not simply limit the number of
dentries allowed on the LRU? We already track that, and the shrinker
already operates on the LRU, so we don't really need any new
infrastructure.
Because this only works well for cooperative workloads. And we can't
really assume that in the virtualization world. One container can
come up with a bogus workload - not even hard to write - that has
the sole purpose of punishing every resource sharer of him.

Sure, but as I've said before you can prevent the container from
consuming too many dentries (via a hard limit) simply by adding a
inode quota per container.  This is exactly the sort of
uncooperative behaviour filesystem quotas were invented to
prevent.

Perhaps we should separate the DOS case from the normal
(co-operative) use case.

As i mentioned previously, your inode allocation based DOS (while
(1); mkdir x; cd x; done type cases) example is trivial to prevent
with quotas. It was claimed that is was not possible to prevent with
filesystem quotas, I left proving that as an exercise for the
reader,but I feel I need to re-iterate my point with an example.

That is, if you can't create a million inodes in the container, you
can't instantiate a million dentries in the container.  For example,
use project quotas on XFS to create directory tree containers with
hard limits on the number of inodes:

David,

The dentry -> inode relationship is a N:1 relationship. Therefore, it is
hard to believe that your example below would still work if we were trying to fill the cache through link operations, instead of operations like mkdir, that enforce a 1:1 relationship.

Caping the dentry numbers, OTOH, caps the # of inodes as well. Although we *do* can have inodes lying around in the caches without an associated dentry at some point in time, we cannot have inodes *pinned* into the cache without an associated dentry. So they will soon enough go away.

So maybe it is the other way around here, and it is people that wants an inode capping that should model it after a dentry cache capping.

$ cat /etc/projects
12345:/mnt/scratch/projects/foo
$ cat /etc/projid
foo:12345
$ sudo mount -o prjquota,delaylog,nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
$ mkdir -p /mnt/scratch/projects/foo
$ sudo xfs_quota -x -c "project -s foo" /mnt/scratch
Setting up project foo (path /mnt/scratch/projects/foo)...
Setting up project foo (path /mnt/scratch/projects/foo)...
Processed 2 (/etc/projects and cmdline) paths for project foo with recursion depth infinite (-1).
$ sudo xfs_quota -x -c "limit -p ihard=1436 foo" /mnt/scratch
$ sudo xfs_quota -c "quota -p foo" /mnt/scratch
$ cd /mnt/scratch/projects/foo/
$ ~/src/fs_mark-3.3/dir-depth
count 1435, err -1, err Disk quota exceeded pwd /mnt/scratch/projects/foo/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/
x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x/x
$

It stopped at 1435 directories because the container
(/mnt/scratch/project/foo) ran out of inodes in it's quota. No DOS
there.

As said above, mkdir enforces a 1:1 relationship (because directories can't be hard linked) that can't be guaranteed in the general case. For the general case, one can have a dentry cache bigger than Linus' ego while instantiating only one inode in the process.

And rather than a ENOMEM error (could be caused by
anything) the error is EDQUOT which is a clear indication that a
resource limit has been hit. That's a far better failure from a user
perspective because they know -exactly- why their application
failed - the container resource limits are too low....
Well, if this is the problem, I am happy returning EDQUOT if we fail to find room for more dentries, or anything else we can agree upon instead of ENOMEM.


IOWs, you don't need to touch the dentry cache -at all- to provide
the per-subtree hard resource limiting you are trying to acheive -
filesystem quotas can already acheive that for you. Project quotas
used in this manner (as directory tree quotas) provide exactly the
"per-subtree" hard resource limiting that you were trying to acheive
with your original dentry mobs proposal.
well, I myself am done for now with the per-subtree proposal. I am completely fine with per-sb for a while now.


The limiting can be lazily - we don't need to limit the growth of
dentries until we start to run out of memory. If the superblock
shrinker is aware of the limits, then when it gets called by memory
reclaim it can do all the work of reducing the number of items on
the LRU down to the threshold at that time.

Well, this idea itself can be considered, independent of which path
we're taking. We can, if we want, allow the dentry cache to grow
indefinitely if we're out of memory pressure. But it kinda defies
the
purpose of a hard limit...

See my comments above about filesystem quotas providing hard limits.

IOWs, the limit has no impact on performance until memory is scarce,
at which time memory reclaim enforces the limits on LRU size and
clean up happens automatically.

This also avoids all the problems of setting a limit lower than the
number of active dentries required for the workload (i.e. avoids
spurious ENOMEM errors trying to allocate dentries), allows
overcommitment when memory is plentiful (which will benefit
performance) but it brings the caches back to defined limits when
memory is not plentiful (which solves the problem you are having).
No, this is not really the problem we're having.
See above.

About ENOMEM, I don't really see what's wrong with them here.

Your backup program runs inside the container. Filesystem traversal
balloons the dentry cache footprint, and so it is likely to trigger
spurious ENOMEM when trying to read files in the container because
it can't allocate a dentry for random files as it traverses. That'll
be fun when it comes to restoring backups and discovering they
aren't complete....

There's also the "WTF caused the ENOMEM error" problem I mentioned
earlier....

For a
container, running out of his assigned kernel memory, should be
exactly the same as running out of real physical memory. I do agree
that it changes the feeling of the system a little bit, because it
then happens more often. But it is still right in principle.

The difference is the degree - when the system runs out of memory,
it tries -really hard- before failing the allocation. Indeed, it'll
swap, it'll free memory in other subsystems, it'll back off on disk
congestion, it will try multiple times to free memory, escalation
priority each time it retries. IOws, it jumps through all sorts of
hoops to free memory before it finally fails. And then the memory,
more often than not, comes from some subsystem other than the dentry
cache, so it is rare that a dentry allocation actually relies on the
dentry cache (and only the dentry cache) being shrunk to provide
memory for the new dentry.

This is apples to oranges comparison. If instead of using the mechanism I proposed, we go for a quota-based mechanism like you mentioned, we'll fail just as often. Just with EDQUOT instead of ENOMEM.

Humm.. while writing that I just looked back on the code, and it seems it will be hard to return anything but ENOMEM, since it is part of the interface contract. OTOH, the inode allocation function goes for the same kind of contract - returning NULL in case of error - meaning that everything of this kind that does not involve the filesystem will end up the same way. Be it in the icache, or dcache.

Your dentry hard limit is has no fallback or other mechanisms to try
- if the VFS caches cannot be shrunk immediately, then ENOMEM will
occur.  There's no retries, there's no waiting for disk congestion
to clear, there's no backoff, there's no increase in reclaim
desparation as previous attempts to free dentries fail. This greatly
increases the chances of ENOMEM from _d_alloc() way above when a
normal machine would see because it doesn't have any of the
functionality that memory reclaim has. And, fundamentally, that sort
of complexity doesn't belong in the dentry cache...

I don't see how/why a user application should care. An error means "Hey Mr. Userspace, something wrong happened", not "Hey Mr. Userspace, sorry, we tried really hard, but yet could not do it".

As far as a container is concerned, The only way to mimic the behavior you described would be to allow a single container to use up at most X bytes of general kernel memory. So when the dcache reaches the wall, it can borrow from somewhere else. Not that I am considering this...


Another interesting case to consider is internally fragmented dentry
cache slabs, where the active population of the pages is sparse.
This sort of population density is quite common on machines with
sustained long term multiple workload usage (exactly what you'd
expect on a containerised system). Hence dentry allocation can be
done without increasing memory footprint at all. Likewise, freeing
dentries won't free any memory at all. In this case, what has your
hard limit bought you? An ENOMEM error in a situation where memory
allocation is actually free from a resource consumption perspective.

Again, I don't think those hard limits are about used memory. So if freeing a dentry may not free memory, I'm still fine with that. If the kernel as a whole needs memory later, it can do something to reclaim it. *Unless* I hold a dentry reference. So the solution to me seems to be not allowing more than X to be held in the first place.

Again: I couldn't care less about how much *memory* it is actually using at a certain point in time.

These are the sorts of corner case problems that hard limits on
cache sizes have. That's the  problem I see with the hard limit
approach: it looks simple, but it is full of corner cases when you
look more deeply. Users are going to hit these corner cases and
want to fix those "problems". We'll have to try to fix them without
really even being able to reproduce them reliably. We'll end up
growing heurisitics to try to detect when problems are about to
happen, and complexity to try to avoid those corner case problems.
We'll muddle along with something that sort of works for the cases
we can reproduce, but ultimately is untestable and unverifiable. In
contrast, a lazy lru limiting solution is simple to implement and
verify and has none of the warts that hard limiting exposes to user
applications.

What you described does not seem to me as a corner case.
"By using this option you can't use more than X entries, if you do, you'll fail" sounds pretty precise to me.


Hence I'd prefer to avoid all the warts of hard limiting by ignoring
the DOS case that leads to requiring a hard limit as it can be
solved by other existing means. Limiting the size of the inactive
cache (generally dominates cache usage) seems like a much lower
impact manner of acheiving the same thing.

Again, I understand you, but I don't think we're really solving the same thing.

Like I said previously - I've had people asking me whether limiting
the size of the inode cache is possible for the past 5 years, and
all their use cases are solved by the lazy mechanism I described. I
think that most of the OpenVZ dcache size problems will also go away
with the lazy solution as well, as most workloads with a large
dentry cache footprint don't actively reference (and therefore pin)
the entire working set at the same time....

Except for the malicious ones, of course.

Cheers,

Thank you very much for your time, Dave!
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux