On Wed, Jul 26, 2023 at 04:44:24PM +0800, Abel Wu wrote: > On 7/26/23 10:56 AM, Roman Gushchin wrote: > > On Mon, Jul 24, 2023 at 11:47:02AM +0800, Abel Wu wrote: > > > Hi Roman, thanks for taking time to have a look! > > > > > > > > > When in legacy mode aka. cgroupv1, the socket memory is charged > > > > > into a separate counter memcg->tcpmem rather than ->memory, so > > > > > the reclaim pressure of the memcg has nothing to do with socket's > > > > > pressure at all. > > > > > > > > But we still might set memcg->socket_pressure and propagate the pressure, > > > > right? > > > > > > Yes, but the pressure comes from memcg->socket_pressure does not mean > > > pressure in socket memory in cgroupv1, which might lead to premature > > > reclamation or throttling on socket memory allocation. As the following > > > example shows: > > > > > > ->memory ->tcpmem > > > limit 10G 10G > > > usage 9G 4G > > > pressure true false > > > > Yes, now it makes sense to me. Thank you for the explanation. > > Cheers! > > > > > Then I'd organize the patchset in the following way: > > 1) cgroup v1-only fix to not throttle tcpmem based on the vmpressure > > 2) a formal code refactoring > > OK, I will take a try to re-organize in next version. Thank you! > > > > > > > > > Overall I think it's a good idea to clean these things up and thank you > > > > for working on this. But I wonder if we can make the next step and leave only > > > > one mechanism for both cgroup v1 and v2 instead of having this weird setup > > > > where memcg->socket_pressure is set differently from different paths on cgroup > > > > v1 and v2. > > > > > > There is some difficulty in unifying the mechanism for both cgroup > > > designs. Throttling socket memory allocation when memcg is under > > > pressure only makes sense when socket memory and other usages are > > > sharing the same limit, which is not true for cgroupv1. Thoughts? > > > > I see... Generally speaking cgroup v1 is considered frozen, so we can leave it > > as it is, except when it creates an unnecessary complexity in the code. > > Are you suggesting that the 2nd patch can be ignored and keep > ->tcpmem_pressure as it is? Or keep the 2nd patch and add some > explanation around as you suggested in last reply? I suggest to split a code refactoring (which is not expected to bring any functional changes) and an actual change of the behavior on cgroup v1. Re the refactoring: I see a lot of value in adding comments and make the code more readable, I don't see that much value in merging two variables. But if it comes organically with the code simplification - nice. > > > > > I'm curious, was your work driven by some real-world problem or a desire to clean > > up the code? Both are valid reasons of course. > > We (a cloud service provider) are migrating users to cgroupv2, > but encountered some problems among which the socket memory > really puts us in a difficult situation. There is no specific > threshold for socket memory in cgroupv2 and relies largely on > workloads doing traffic control themselves. > > Say one workload behaves fine in cgroupv1 with 10G of ->memory > and 1G of ->tcpmem, but will suck (or even be OOMed) in cgroupv2 > with 11G of ->memory due to burst memory usage on socket. > > It's rational for the workloads to build some traffic control > to better utilize the resources they bought, but from kernel's > point of view it's also reasonable to suppress the allocation > of socket memory once there is a shortage of free memory, given > that performance degradation is better than failure. Yeah, I can see it. But Idk if it's too workload-specific to have a single-policy-fits-all-cases approach. E.g. some workloads might prefer to have a portion of pagecache being reclaimed. What do you think? > > Currently the mechanism of net-memcg's pressure doesn't work as > we expected, please check the discussion in [1]. Besides this, > we are also working on mitigating the priority inversion issue > introduced by the net protocols' global shared thresholds [2], > which has something to do with the net-memcg's pressure. This > patchset and maybe some other are byproducts of the above work. > > [1] https://lore.kernel.org/netdev/20230602081135.75424-1-wuyun.abel@xxxxxxxxxxxxx/ > [2] https://lore.kernel.org/netdev/20230609082712.34889-1-wuyun.abel@xxxxxxxxxxxxx/ Thanks for the clarification!