On Tue, Jun 15, 2021 at 01:32:22PM +0200, Christian Brauner wrote: > On Thu, Jun 03, 2021 at 12:43:07PM +0200, legion@xxxxxxxxxx wrote: > > From: Alexey Gladkov <legion@xxxxxxxxxx> > > > > The /proc/meminfo contains information regardless of the cgroups > > restrictions. This file is still widely used [1]. This means that all > > these programs will not work correctly inside container [2][3][4]. Some > > programs try to respect the cgroups limits, but not all of them > > implement support for all cgroup versions [5]. > > > > Correct information can be obtained from cgroups, but this requires the > > cgroups to be available inside container and the correct version of > > cgroups to be supported. > > > > There is lxcfs [6] that emulates /proc/meminfo using fuse to provide > > information regarding cgroups. This patch can help them. > > > > This patch adds /proc/self/meminfo that contains a subset of > > /proc/meminfo respecting cgroup restrictions. > > > > We cannot just create /proc/self/meminfo and make a symlink at the old > > location because this will break the existing apparmor rules [7]. > > Therefore, the patch adds a separate file with the same format. > > Interesting work. Thanks. This is basically a variant of what I > suggested at Plumbers and in [1]. I made the second version of the patch [1], but then I had a conversation with Eric W. Biederman offlist. He convinced me that it is a bad idea to change all the values in meminfo to accommodate cgroups. But we agreed that MemAvailable in /proc/meminfo should respect cgroups limits. This field was created to hide implementation details when calculating available memory. You can see that it is quite widely used [2]. So I want to try to move in that direction. [1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0 [2] https://codesearch.debian.net/search?q=MemAvailable%3A > Judging from the patches sent by Waiman Long in [2] to also virtualize > /proc/cpuinfo and /sys/devices/system/cpu this is a larger push to > provide virtualized system information to containers. > > Although somewhere in the thread here this veered off into apparently > just being a way for a process to gather information about it's own > resources. At which point I'm confused why looking at its cgroups > isn't enough. I think it's not enough. As an example: $ mount -t cgroup2 none /sys/fs/cgroup $ echo +memory > /sys/fs/cgroup/cgroup.subtree_control $ mkdir /sys/fs/cgroup/mem0 $ echo +memory > /sys/fs/cgroup/mem0/cgroup.subtree_control $ mkdir /sys/fs/cgroup/mem0/mem1 $ echo $$ > /sys/fs/cgroup/mem0/mem1/cgroup.procs I didn't set a limit and just added the shell to the group. $ cat /proc/self/cgroup 0::/mem0/mem1 $ cat /sys/fs/cgroup/mem0/mem1/memory.max max $ cat /sys/fs/cgroup/mem0/memory.max max In this case we need to use MemAvailable from /proc/meminfo. Another example: $ mount -t cgroup2 none /sys/fs/cgroup $ echo +memory > /sys/fs/cgroup/cgroup.subtree_control $ mkdir /sys/fs/cgroup/mem0 $ echo $(( 3 * 1024 * 1024 )) > /sys/fs/cgroup/mem0/memory.max $ echo +memory > /sys/fs/cgroup/mem0/cgroup.subtree_control $ mkdir /sys/fs/cgroup/mem0/mem1 $ echo $(( 3 * 1024 * 1024 * 1024 * 1024 )) > /sys/fs/cgroup/mem0/mem1/memory.max $ echo $$ > /sys/fs/cgroup/mem0/mem1/cgroup.procs $ head -3 /proc/meminfo MemTotal: 1002348 kB MemFree: 972712 kB MemAvailable: 968100 kB $ cat /sys/fs/cgroup/mem0{,/mem1}/memory.max 3145728 3298534883328 Now, I have cgroup limits, but you can write absolutely any value as a limit. So how much memory is available to shell in this case? To get this value, you need to take the minimum of MemAvailable and **/memory.max. ... or I fundamentally don't understand something. > So /proc/self/meminfo seems to just be the start. And note the two > approaches seem to diverge too. This provides a new file while the other > patchset virtualizes existing proc files/folders. > > In any case it seems you might want to talk since afaict you're all at > the same company but don't seem to be aware of each others work (Which > happens of course.). > > For the sake of history such patchsets have been pushed for before by > the Siteground people. > > Chris and Johannes made a good point that the information provided in > this file can be gathered from cgroups already. So applications should > probably switch to reading those out of their cgroup and most are doing > that already. > > And reading values out of cgroups is pretty straightforward even with > the differences between cgroup v1 and v2. Userspace is doing it all over > the place all of the time and the code has now existed for years so the > cgroup interface is a problem. And with cgroup v2 it keeps growing so > much more useful metrics that looking at meminfo isn't really cutting it > anyway. > > So I think the argument that applications should start looking at their > cgroup info if they want to find out detailed info is a solid argument > that shouldn't be easily brushed aside. > > What might be worth is knowing exactly what applications are looking at > /proc/meminfo and /proc/cpuinfo and make decision based on that info. > None of that is clearly outlined in the thread unfortunately. > > So I immediately see two types of applications that could benefit from > this patchset. The first ones are legacy applications that aren't aware > of cgroups and aren't actively maintained. Introducing such > functionality for these applications seems a weak argument. > > The second type is new and maintained applications that look at global > info such as /proc/meminfo and /proc/cpuinfo. So such applications have > ignored cgroups for a decade now. This makes it very unconvincing that > they will suddenly switch to a newly introduced file. Especially if the > entries in a new file aren't a 1:1 mapping of the old file. > > Johannes made another good point about it not being clear what > applications actually want. And he's very right in that. It seems > straightforward to virtualize things like meminfo but it actually isn't. > And it's something you quite often discover after the fact. We have > extensive experience implementing it in LXCFS in userspace. People kept > and keep arguing what information exactly is supposed to go into > calculating those values based on what best helps their use-case. > > Swap was an especially contentious point. In fact, sometimes users want > to turn of swap even though it exists on the host and there's a command > line switch in LXCFS to control that behavior. > > Another example supporting Johannes worry is virtualizing /proc/cpuinfo > where some people wanted to virtualize cpu counts based on cpu shares. > So we have two modes to virtualize cpus: based on cpuset alone or based > on cpuset and cpu shares. And both modes are actively used. And that all > really depends on application and workload. > > Finally, although LXCFS is briefly referenced in the commit message but > it isn't explained very well and what it does. > > And we should consider it since this is a full existing userspace > solution to the problem solved in this patchset including Dan's JRE > use-case. > > This is a project started in 2014 and it is in production use since 2014 > and it delivers the features of this patchset here and more. > > For example, it's used in the Linux susbystem of Chromebooks, it's used > by Alibaba (see [3]) and it is used for the JRE use-case by Google's > Anthos when migrating such legacy applications (see [4]). > > At first, I was convinced we could make use of /proc/self/meminfo in > LXCFS which is why I held back but we can't. We can't simply bind-mount > it over /proc/meminfo because it's not a 1:1 correspondence between all > fields. We could potentially read some values we now calculate and > display it in /proc/meminfo but we can't stop virtualizing /proc/meminfo > itself. So we don't gain anything from this. When Alex asked me about it > I tried to come up with good ways to integrate this but the gain is just > too little for us. > > Because our experience tells us that applications that want this type of > virtualization don't really care about heir own resources. They care > about a virtualized view of the system's resources. And the system in > question is often a container. But it get's very tricky since we don't > really define what a container is. So what data the user wants to see > depends on the used container runtime, type of container, and workload. > An application container has very different needs than a system > container that boots systemd. LXCFS can be very flexible here and > virtualize according to the users preferences (see the split between > cpuset and cpuset + cpu shares virtualization for cpu counts). > > In any case, LXCFS is a tiny FUSE filesystem which virtualizes various > procfs and sysfs files for a container: > > /proc/cpuinfo > /proc/diskstats > /proc/meminfo > /proc/stat > /proc/swaps > /proc/uptime > /proc/slabinfo > /sys/devices/system/cpu/* > /sys/devices/system/cpu/online > > If you call top in a container that makes use of this it will display > everything virtualized to the container (See [5] for an example of > /proc/cpuinfo and /sys/devices/system/cpu/*.). And JRE will not > overallocate resources. It's actively used for all of that. > > Below at [5] you can find an example where 2 cpus out of 8 have been > assigned to the container's cpuset. The container values are virtualized > as you can see. > > [1]: https://lkml.org/lkml/2020/6/4/951 > [2]: https://lore.kernel.org/lkml/YMe/cGV4JPbzFRk0@xxxxxxxxxxxxxxx > [3]: https://www.alibabacloud.com/blog/kubernetes-demystified-using-lxcfs-to-improve-container-resource-visibility_594109 > [4]: https://cloud.google.com/blog/products/containers-kubernetes/migrate-for-anthos-streamlines-legacy-java-app-modernization > [5]: ## /proc/cpuinfo > #### Host > brauner@wittgenstein|~ > > ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]] > drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu0 > drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu1 > drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu2 > drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu3 > drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu4 > drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu5 > drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu6 > drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu7 > > #### Container > brauner@wittgenstein|~ > > lxc exec f1 -- ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]] > drwxr-xr-x 2 nobody nogroup 0 Jun 15 10:22 cpu3 > drwxr-xr-x 2 nobody nogroup 0 Jun 15 10:22 cpu4 > > ## /sys/devices/system/cpu/* > #### Host > brauner@wittgenstein|~ > > grep ^processor /proc/cpuinfo > processor : 0 > processor : 1 > processor : 2 > processor : 3 > processor : 4 > processor : 5 > processor : 6 > processor : 7 > > #### Container > brauner@wittgenstein|~ > > lxc exec f1 -- grep ^processor /proc/cpuinfo > processor : 0 > processor : 1 > > ## top > #### Host > top - 13:16:47 up 15:54, 39 users, load average: 0,76, 0,47, 0,40 > Tasks: 434 total, 1 running, 433 sleeping, 0 stopped, 0 zombie > %Cpu0 : 2,7 us, 2,4 sy, 0,0 ni, 94,5 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st > %Cpu1 : 3,3 us, 1,3 sy, 0,0 ni, 95,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st > %Cpu2 : 1,6 us, 9,1 sy, 0,0 ni, 89,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st > %Cpu3 : 2,3 us, 1,3 sy, 0,0 ni, 96,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st > %Cpu4 : 2,7 us, 1,7 sy, 0,0 ni, 95,7 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st > %Cpu5 : 2,9 us, 2,9 sy, 0,0 ni, 94,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st > %Cpu6 : 2,3 us, 1,0 sy, 0,0 ni, 96,3 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st > %Cpu7 : 3,3 us, 1,3 sy, 0,0 ni, 95,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st > > #### Container > top - 11:16:13 up 2:08, 0 users, load average: 0.27, 0.36, 0.36 > Tasks: 24 total, 1 running, 23 sleeping, 0 stopped, 0 zombie > %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > -- Rgrds, legion