On 9/20/2023 1:07 PM, Michal Hocko wrote: > On Wed 20-09-23 12:04:48, Jeremi Piotrowski wrote: >> On 9/20/2023 10:43 AM, Michal Hocko wrote: >>> On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote: >>>> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote: >>>>> 6.1-stable review patch. If anyone has any objections, please let me know. >>>>> >>>>> ------------------ >>>> >>>> Hi Greg/Michal, >>>> >>>> This commit breaks userspace which makes it a bad commit for mainline and an >>>> even worse commit for stable. >>>> >>>> We ingested 6.1.54 into our nightly testing and found that runc fails to gather >>>> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored >>>> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is >>>> fine. >>> >>> Could you expand some more on why is the file read? It doesn't support >>> writing to it for some time so how does reading it helps in any sense? >>> >>> Anyway, I do agree that the stable backport should be reverted. >>> >> >> This file is read together with all the other memcg files. Each prefix: >> >> memory >> memory.memsw >> memory.kmem >> memory.kmem.tcp >> >> is combined with these suffixes >> >> .usage_in_bytes >> .max_usage_in_bytes >> .failcnt >> .limit_in_bytes >> >> and read, the values are then forwarded on to other components for scheduling decisions. >> You want to know the limit when checking the usage (is the usage close to the limit or not). > > You know there is no kmem limit as there is no way to set it for some > time (since 5.16 - i.e. 2 years ago). I can see that users following old > kernels could have missed that though. I know what you mean, but I think this generally went unnoticed because the limit file is read unconditionally, but only written when a kmem limit is explicitly requested for a specific container, which is rarely (if ever) done. Regarding following old kernels: a majority of kubernetes users are still on 5.15 and only slowly started shifting to >=6.1 very recently (this summer). This is mostly driven by distro vendor policies which tend to follow the pattern of "follow LTS kernels but don't switch to the next LTS immediately". I know this is far from ideal for reporting these kinds of issues, would love to report them as soon as a kernel release happens. > >> Userspace tolerates MEMCG/MEMCG_KMEM being disabled, but having a single file out of the >> set missing is an anomaly. So maybe we could keep the dummy file just for the >> sake of consistency? Cgroupv1 is legacy after all. > > What we had was a dummy file. It didn't allow to write any value so it > would have always reported unlimited. The reason I've decided to remove > the file was that there were other users not being able to handle the > write failure while they are just fine not having the file. So we are > effectively between a rock and hard place here. Either way something is > broken. The other SW got fixed as well but similar to your case it takes > some time to absorb the change through all 3rd party users. > >>>>> Address this by wiping out the file completely and effectively get back to >>>>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration. >>>> >>>> On reads, the runc code checks for MEMCG_KMEM=n by checking >>>> kmem.usage_in_bytes. > > Just one side note. Config options get renamed and their semantic > changes over time so I would just recomment to never make any > dependencies on any specific one. > Right, what i meant is the logic is this, with checking the "usage" file to determine whether the controller is available: value, err := fscommon.GetCgroupParamUint(path, usage) if err != nil { if name != "" && os.IsNotExist(err) { // Ignore ENOENT as swap and kmem controllers // are optional in the kernel. return cgroups.MemoryData{}, nil } return cgroups.MemoryData{}, err } and if it is, then it proceeds to read "limit_in_bytes" and the others. >>>> If it is present then runc expects the other cgroup files >>>> to be there (including kmem.limit_in_bytes). So this change is not effectively >>>> the same. >>>> >>>> Here's a link to the PR that would be needed to handle this change in userspace >>>> (not merged yet and would need to be propagated through the ecosystem): >>>> >>>> https://github.com/opencontainers/runc/pull/4018. >>> >>> Thanks. Does that mean the revert is still necessary for the Linus tree >>> or do you expect that the fix can be merged and propagated in a >>> reasonable time? >>> >> >> We can probably get runc and currently supported kubernetes versions patched in time >> before 6.6 (or the next LTS kernel) hits LTS distros. >> >> But there's still a bunch of users running cgroupv1 with unsupported kubernetes >> versions that are still taking kernel updates as they come, so this might get reported >> again next year if it stays in mainline. > > I can see how 3rd party users are hard to get aligned but having a fix > available should allow them to apply it or is there any actual roadblock > for them to adapt as soon as they hit the issue? > The issue with this is that these users are running a frozen set of kubernetes (+runc) binaries, but still pull kernel updates from the base distro. These kubernetes versions are out of maintenance so the code will not get fixed and no one will release fixed binaries. > I mean, normally I would be just fine reverting this API change because > it is disruptive but the only way to have the file available and not > break somebody is to revert 58056f77502f ("memcg, kmem: further > deprecate kmem.limit_in_bytes") as well. Or to ignore any value written > there but that sounds rather dubious. Although one could argue this > would mimic nokmem kernel option. > I just want to make sure we don't introduce yet another new behavior in this legacy system. I have not seen breakage due to 58056f77502f. Mimicing nokmem sounds good but does this mean "don't enforce limits" (that should be fine) or "ignore writes to the limit" (=don't event store the written limit). The latter might have unintended consequences.