Re: [PATCH] psi: reduce min window size to 50ms

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Mon, 27 Feb 2023 11:51:26 -0800

On Mon, Feb 27, 2023 at 11:19 AM Josh Hunt <johunt@xxxxxxxxxx> wrote:
>
>
>
> On 2/27/23 9:49 AM, Suren Baghdasaryan wrote:
> > On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >>
> >> On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> >>> On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >>>>
> >>>> On Tue 14-02-23 11:34:30, Suren Baghdasaryan wrote:
> >>>> [...]
> >>>>> Your suggestion to have this limit configurable sounds like obvious
> >>>>> solution. I would like to get some opinions from other maintainers.
> >>>>> Johannes, WDYT? CC'ing Michal to chime in as well since this is mostly
> >>>>> related to memory stalls.
> >>>>
> >>>> I do not think that making this configurable helps much. Many users will
> >>>> be bound to distribution config and also it would be hard to experiment
> >>>> with a recompile cycle every time. This seems just too impractical.
> >>>>
> >>>> Is there any reason why we shouldn't allow any timeout? Shorter
> >>>> timeouts could be restricted to a priviledged context to avoid an easy
> >>>> way to swamp system by too frequent polling.
> >>>
> >>> Hmm, ok. Maybe then we just ensure that only privileged users can set
> >>> triggers and remove the min limit (use a >0 check)?
> >>
> >> This could break existing userspace which is not privileged. I would
> >> just go with CAP_SYS_NICE or similar with small (sub min) timeouts.
> >
> > Yeah, that's what I meant. /proc/pressure/* files already check for
> > CAP_SYS_RESOURCE
> > (https://urldefense.com/v3/__https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c*L1440__;Iw!!GjvTz_vk!WtI61poYlZk9kg5P1sX19RdYnUNGvBJRjnOpu8hL6IOZ_NKhuw2qZ_tAdNRwzZoQVlO4jEObYN6x$ )
> > but per-cgroup pressure files do not have this check. I think the
> > original patch which added this check
> > (https://urldefense.com/v3/__https://lore.kernel.org/all/20210402025833.27599-1-johunt@xxxxxxxxxx/__;!!GjvTz_vk!WtI61poYlZk9kg5P1sX19RdYnUNGvBJRjnOpu8hL6IOZ_NKhuw2qZ_tAdNRwzZoQVlO4jAVqIVDv$ )
> > missed the cgroup ones. This should be easy to add but I wonder if
> > that was left that way intentionally.
> >
> > CC'ing the author. Josh, Johannes is that inconsistency between system
> > pressure files and cgroup-specific ones intentional? Can we change
> > them all to check for CAP_SYS_RESOURCE?
>
> No, this was just an oversight in the original patch at least from my
> end, and did not come up during code review. Fine with me to change them
> all to use CAP_SYS_RESOURCE.

Thanks for the confirmation! Will get this fixed.

>
> Josh
>
> >
> >>
> >>>> Btw. it seems that there is is only a limit on a single trigger per fd
> >>>> but no limits per user so it doesn't sound too hard to end up with too
> >>>> much polling even with a larger timeouts. To me it seems like we need to
> >>>> contain the polling thread to be bound by the cpu controller.
> >>>
> >>> Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> >>> poll_min_period for each thread is chosen as the min() of polling
> >>> periods between triggers created in that group. So, a bad trigger that
> >>> causes overly aggressive polling and polling thread being throttled,
> >>> might affect other triggers in that cgroup.
> >>
> >> Yes, and why that would be a problem?
> >
> > If unprivileged processes are allowed to add new triggers then a
> > malicious process can add a bad trigger and affect other legit
> > processes. That sounds like a problem to me.
> > Thanks,
> > Suren.
> >
> >> --
> >> Michal Hocko
> >> SUSE Labs