On 6/9/21 16:56, Eric W. Biederman wrote:
Johannes Weiner <hannes@xxxxxxxxxxx> writes:
On Wed, Jun 09, 2021 at 02:14:16PM -0500, Eric W. Biederman wrote:
"Enrico Weigelt, metux IT consult" <lkml@xxxxxxxxx> writes:
On 03.06.21 13:33, Chris Down wrote:
Hi folks,
Putting stuff in /proc to get around the problem of "some other metric I need
might not be exported to a container" is not a very compelling argument. If
they want it, then export it to the container...
Ultimately, if they're going to have to add support for a new
/proc/self/meminfo file anyway, these use cases should just do it properly
through the already supported APIs.
It's even a bit more complex ...
/proc/meminfo always tells what the *machine* has available, not what a
process can eat up. That has been this way even long before cgroups.
(eg. ulimits).
Even if you want a container look more like a VM - /proc/meminfo showing
what the container (instead of the machine) has available - just looking
at the calling task's cgroup is also wrong. Because there're cgroups
outside containers (that really shouldn't be affected) and there're even
other cgroups inside the container (that further restrict below the
container's limits).
BTW: applications trying to autotune themselves by looking at
/proc/meminfo are broken-by-design anyways. This never has been a valid
metric on how much memory invididual processes can or should eat.
Which brings us to the problem.
Using /proc/meminfo is not valid unless your application can know it has
the machine to itself. Something that is becoming increasing less
common.
Unless something has changed in the last couple of years, reading values
out of the cgroup filesystem is both difficult (v1 and v2 have some
gratuitous differences) and is actively discouraged.
So what should applications do?
Alex has found applications that are trying to do something with
meminfo, and the fields that those applications care about. I don't see
anyone making the case that specifically what the applications are
trying to do is buggy.
Alex's suggest is to have a /proc/self/meminfo that has the information
that applications want, which would be something that would be easy
to switch applications to. The patch to userspace at that point is
as simple as 3 lines of code. I can imagine people take that patch into
their userspace programs.
But is it actually what applications want?
Not all the information at the system level translates well to the
container level. Things like available memory require a hierarchical
assessment rather than just a look at the local level, since there
could be limits higher up the tree.
That sounds like a bug in the implementation of /proc/self/meminfo.
It certainly is a legitimate question to ask what are the limits
from my perspective.
Not all items in meminfo have a container equivalent, either.
Not all items in meminfo were implemented.
The familiar format is likely a liability rather than an asset.
It could be. At the same time that is the only format anyone has
proposed so we good counter proposal would be appreciated if you don't
like the code that has been written.
The simple fact that people are using /proc/meminfo when it doesn't make
sense for anything except system monitoring tools is a pretty solid bug
report on the existing linux apis.
I agree that we likely need a better interface for applications to
query the memory state of their container. But I don't think we should
try to emulate a format that is a poor fit for this.
I don't think it is the container that we care about (except for maybe
system managment tools). I think the truly interesting case is
applications asking what do I have available to me.
Have heard that the JRE makes assumptions on the number of threads to
use based on memory.
Lots of Humans use top and vmstat to try to figure out what is available
in their environment. Debugging tools trying to figure out why an
application is running poorly.
We would like to not need to mount the cgroup file system into a
container at all, and as Eric stated processes trying to differentiate
between cgroupv1 and cgroupv2.
We should also not speculate what users intended to do with the
meminfo data right now. There is a surprising amount of misconception
around what these values actually mean. I'd rather have users show up
on the mailing list directly and outline the broader usecase.
We are kernel developers, we can read code. We don't need to speculate.
We can read the userspace code. If things are not clear we can ask
their developers.
Eric