Re: Possible regression with cgroups in 3.11

Markus Blank-Burian <burian@xxxxxxxxxxx> · Fri, 29 Nov 2013 09:33:10 +0100

The migration part is disabled because we had another problem with
this specific plugin. Today I saw a post on the slurm mailing list,
eventually describing the same problem, also with 3.11:
https://groups.google.com/forum/#!topic/slurm-devel/26nTXLcL3yI

Basically I have many small jobs scheduled for a maximum runtime of 10
seconds, starting at the same time and therefore also ending at the
same time .. this reproduces it within seconds on my test node within
the cluster. I hope that I can reproduce this on my desktop machine
and try to come up with a simple script, but this might take a few
days.

On Thu, Nov 28, 2013 at 6:05 PM, Michal Hocko <mhocko@xxxxxxx> wrote:
> On Tue 26-11-13 22:05:47, Markus Blank-Burian wrote:
>> > OK, this would suggest that some charges were accounted to a different
>> > group than the corresponding pages group's LRUs or that the charge cache (stock)
>> > is b0rked (the later can be checked easily by making refill_stock a noop
>> > - see the patch below - I am skeptical that would help though).
>>
>> You were right, still no change.
>>
>> > Let's rule out some usual suspects while I am staring at the
>> > code. Are the tasks migrated between groups? What is the value of
>> > memory.move_charge_at_immigrate?  Have you seen any memcg oom messages
>> > in the log?
>>
>> - i dont see anything about migration, but there is a part about
>> setting "memory.force_empty". i did not see the corresponding trace
>> output .. but i will recheck this. (see
>> https://github.com/SchedMD/slurm/blob/master/src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup_memory.c)
>
>         if (xcgroup_create(&memory_ns, &memory_cg, "", 0, 0)
>          == XCGROUP_SUCCESS) {
>                 xcgroup_set_uint32_param(&memory_cg, "tasks", getpid());
>                 xcgroup_destroy(&memory_cg);
>                 xcgroup_set_param(&step_memory_cg, "memory.force_empty", "1");
>         }
>
> So the current task is moved to memory_cg which is probably root and
> then it tries to free memory by writing to force_empty.
>
>> - the only interesting part of the release_agent is the removal of the
>> cgroup hierarchy (mountdir is /sys/fs/cgroup/memory): flock -x
>> ${mountdir} -c "rmdir ${rmcg}"
>
> OK, so only a single group is removed at the time.
>
>> - memory.move_charge_at_immigrate is "0"
>
> OK, so the pages of the moved process stay in the original group. This
> rules out races of charge with move.
>
> I have checked the charging paths and we always commit (set memcg to
> page_cgroup) to the charged memcg. The only more complicated case is
> swapin but you've said you do not have any swap active.
>
> I am getting clueless :/
>
> Is your setup easily replicable so that I can play with it?
>
>> - i never saw any oom messages related to this problem. i checked
>> explicitly before reporting the first time, if this might somehow be
>> oom-related
>
> --
> Michal Hocko
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html