Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> · Fri, 4 Jan 2019 08:46:28 -0800

On 1/4/19 12:55 AM, Michal Hocko wrote:
On Thu 03-01-19 20:15:30, Yang Shi wrote:

On 1/3/19 12:01 PM, Michal Hocko wrote:
On Thu 03-01-19 11:49:32, Yang Shi wrote:
On 1/3/19 11:23 AM, Michal Hocko wrote:
On Thu 03-01-19 11:10:00, Yang Shi wrote:
On 1/3/19 10:53 AM, Michal Hocko wrote:
On Thu 03-01-19 10:40:54, Yang Shi wrote:
On 1/3/19 10:13 AM, Michal Hocko wrote:
[...]
Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?
I would say it has not to be strictly sequential. The above script is just
an example to illustrate the pattern. But, sometimes it may hit such pattern
due to the complicated cluster scheduling and container scheduling in the
production environment, for example the creation process might be scheduled
to the same CPU which is doing force_empty. I have to say I don't know too
much about the internals of the container scheduling.
In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.
Yes, it does introduce some additional code and semantic, but IMHO, it is
quite simple and very straight forward, isn't it? Just utilize the existing
css offline worker. And, that a couple of lines of code do improve some
throughput issues for some real usecases.
I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.
Er, I may not articulate in the earlier email, force_empty can not be
offloaded from userspace *trivially*. IOWs the container scheduler may
unexpectedly overcommit something due to the stall of synchronous force
empty, which can't be figured out by userspace before it actually happens.
The scheduler doesn't know how long force_empty would take. If the
force_empty could be offloaded by kernel, it would make scheduler's life
much easier. This is not something userspace could do.
What exactly prevents
(
echo 1 > $memecg/force_empty
rmdir $memcg
) &

so that this sequence doesn't really block anything?
We have "restarting the same name job" logic in our usecase (I'm not quite
sure why they do so). Basically, it means to create memcg with the exact
same name right after the old one is deleted, but may have different limit
or other settings. The creation has to wait for rmdir is done. Even though
rmdir is done in background like the above, the stall still exists since
rmdir simply is waiting for force_empty.
OK, I see. This is an important detail you didn't mention previously (or
at least I didn't understand it). One thing is still not clear to me.

Sorry, I should articulated at the first place.

"Restarting the same job" sounds as if the memcg itself could be
recycled as well. You are saying that the setting might change but if
that is about limits then we should handle that just fine. Or what other
kind of setting changes that wouldn't work properly?

We did try resize limit, but it may also incur costly direct reclaim to 
block something. Other than this we also want to reset all the 
counters/stats to get clearer and cleaner resource isolation since the 
container may run different jobs although they use the same name.

If the recycling is not possible then I would suggest to not reuse
force_empty interface but add wipe_on_destruction or similar new knob
which would enforce reclaim on offlining. It seems we have several
people asking for something like that already.

We did have a new knob in our in-house implementation, it just did 
force_empty on offlining.

So, you mean to have a new knob to just do force empty offlining, and 
keep force_empty's behavior, right?

Thanks,
Yang