Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> wrote: > On 1/3/19 11:23 AM, Michal Hocko wrote: >> On Thu 03-01-19 11:10:00, Yang Shi wrote: >>> >>> On 1/3/19 10:53 AM, Michal Hocko wrote: >>>> On Thu 03-01-19 10:40:54, Yang Shi wrote: >>>>> On 1/3/19 10:13 AM, Michal Hocko wrote: >> [...] >>>>>> Is there any reason for your scripts to be strictly sequential here? In >>>>>> other words why cannot you offload those expensive operations to a >>>>>> detached context in _userspace_? >>>>> I would say it has not to be strictly sequential. The above script is just >>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern >>>>> due to the complicated cluster scheduling and container scheduling in the >>>>> production environment, for example the creation process might be scheduled >>>>> to the same CPU which is doing force_empty. I have to say I don't know too >>>>> much about the internals of the container scheduling. >>>> In that case I do not see a strong reason to implement the offloding >>>> into the kernel. It is an additional code and semantic to maintain. >>> Yes, it does introduce some additional code and semantic, but IMHO, it is >>> quite simple and very straight forward, isn't it? Just utilize the existing >>> css offline worker. And, that a couple of lines of code do improve some >>> throughput issues for some real usecases. >> I do not really care it is few LOC. It is more important that it is >> conflating force_empty into offlining logic. There was a good reason to >> remove reparenting/emptying the memcg during the offline. Considering >> that you can offload force_empty from userspace trivially then I do not >> see any reason to implement it in the kernel. > > Er, I may not articulate in the earlier email, force_empty can not be > offloaded from userspace *trivially*. IOWs the container scheduler may > unexpectedly overcommit something due to the stall of synchronous force > empty, which can't be figured out by userspace before it actually > happens. The scheduler doesn't know how long force_empty would take. If > the force_empty could be offloaded by kernel, it would make scheduler's > life much easier. This is not something userspace could do. If kernel workqueues are doing more work (i.e. force_empty processing), then it seem like the time to offline could grow. I'm not sure if that's important. I assume that if we make force_empty an async side effect of rmdir then user space scheduler would not be unable to immediately assume the rmdir'd container memory is available without subjecting a new container to direct reclaim. So it seems like user space would use a mechanism to wait for reclaim: either the existing sync force_empty or polling meminfo/etc waiting for free memory to appear. >>>> I think it is more important to discuss whether we want to introduce >>>> force_empty in cgroup v2. >>> We would prefer have it in v2 as well. >> Then bring this up in a separate email thread please. > > Sure. Will prepare the patches later. > > Thanks, > Yang