On Mon, Feb 5, 2024 at 1:16 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Mon 05-02-24 12:47:47, T.J. Mercier wrote: > > On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > [...] > > > This of something like > > > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim > > > where timeout acts as a stop gap if the reclaim cannot finish in > > > TIMEOUT. > > > > Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve > > it is what's bothering me. > > I am not really happy about this subtlety. If we have a better way then > let's do it. Better in its own patch, though. > > > It's already wired up that way though, so if you want to make this > > change now then I can try to test for the difference using really > > large reclaim targets. > > Yes, please. If you want it a separate patch then no objection from me > of course. If you do no like the nr_to_reclaim bailout then maybe we can > go with a simple break out flag in scan_control. > > Thanks! It's a bit difficult to test under the too_many_isolated check, so I moved the fatal_signal_pending check outside and tried with that. Performing full reclaim on the /uid_0 cgroup with a 250ms delay before SIGKILL, I got an average of 16ms better latency with sc->nr_to_reclaim across 20 runs ignoring one 1s outlier with SWAP_CLUSTER_MAX. The return values from memory_reclaim are different since with sc->nr_to_reclaim we "succeed" and don't reach the signal_pending check to return -EINTR, but I don't think it matters since the return code is 137 (SIGKILL) in both cases. With SWAP_CLUSTER_MAX there was an outlier at nearly 1s, and in general the latency numbers were noiser: 2% RSD vs 13% RSD. I'm guessing that's a function of nr_to_scan being occasionally much less than SWAP_CLUSTER_MAX causing nr[lru] to drain slowly. But it could also have simply been scheduled out more often at the cond_resched in shrink_lruvec, and that would help explain the 1s outlier. I don't have enough debug info on the outlier to say much more. With sc->nr_to_reclaim, the largest sc->nr_reclaimed value I saw was about 2^53 for a sc->nr_to_reclaim of 2^51, but for large memcg hierarchies I think it's possible to get more than that. There were only 15 cgroups under /uid_0. This is the only thing that gives me pause, since we could touch more than 2k cgroups in shrink_node_memcgs, each one adding 4 * 2^51, potentially overflowing sc->nr_to_reclaim. Looks testable but I didn't get to it.