Re: [PATCH] delayacct: track delays from ksm cow

David Hildenbrand <david@xxxxxxxxxx> · Tue, 22 Mar 2022 08:55:15 +0100

On 22.03.22 04:12, CGEL wrote:
> On Mon, Mar 21, 2022 at 04:45:40PM +0100, David Hildenbrand wrote:
>> On 20.03.22 07:13, CGEL wrote:
>>> On Fri, Mar 18, 2022 at 09:24:44AM +0100, David Hildenbrand wrote:
>>>> On 18.03.22 02:41, CGEL wrote:
>>>>> On Thu, Mar 17, 2022 at 11:05:22AM +0100, David Hildenbrand wrote:
>>>>>> On 17.03.22 10:48, CGEL wrote:
>>>>>>> On Thu, Mar 17, 2022 at 09:17:13AM +0100, David Hildenbrand wrote:
>>>>>>>> On 17.03.22 03:03, CGEL wrote:
>>>>>>>>> On Wed, Mar 16, 2022 at 03:56:23PM +0100, David Hildenbrand wrote:
>>>>>>>>>> On 16.03.22 14:34, cgel.zte@xxxxxxxxx wrote:
>>>>>>>>>>> From: Yang Yang <yang.yang29@xxxxxxxxxx>
>>>>>>>>>>>
>>>>>>>>>>> Delay accounting does not track the delay of ksm cow.  When tasks
>>>>>>>>>>> have many ksm pages, it may spend a amount of time waiting for ksm
>>>>>>>>>>> cow.
>>>>>>>>>>>
>>>>>>>>>>> To get the impact of tasks in ksm cow, measure the delay when ksm
>>>>>>>>>>> cow happens. This could help users to decide whether to user ksm
>>>>>>>>>>> or not.
>>>>>>>>>>>
>>>>>>>>>>> Also update tools/accounting/getdelays.c:
>>>>>>>>>>>
>>>>>>>>>>>     / # ./getdelays -dl -p 231
>>>>>>>>>>>     print delayacct stats ON
>>>>>>>>>>>     listen forever
>>>>>>>>>>>     PID     231
>>>>>>>>>>>
>>>>>>>>>>>     CPU             count     real total  virtual total    delay total  delay average
>>>>>>>>>>>                      6247     1859000000     2154070021     1674255063          0.268ms
>>>>>>>>>>>     IO              count    delay total  delay average
>>>>>>>>>>>                         0              0              0ms
>>>>>>>>>>>     SWAP            count    delay total  delay average
>>>>>>>>>>>                         0              0              0ms
>>>>>>>>>>>     RECLAIM         count    delay total  delay average
>>>>>>>>>>>                         0              0              0ms
>>>>>>>>>>>     THRASHING       count    delay total  delay average
>>>>>>>>>>>                         0              0              0ms
>>>>>>>>>>>     KSM             count    delay total  delay average
>>>>>>>>>>>                      3635      271567604              0ms
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> TBH I'm not sure how particularly helpful this is and if we want this.
>>>>>>>>>>
>>>>>>>>> Thanks for replying.
>>>>>>>>>
>>>>>>>>> Users may use ksm by calling madvise(, , MADV_MERGEABLE) when they want
>>>>>>>>> save memory, it's a tradeoff by suffering delay on ksm cow. Users can
>>>>>>>>> get to know how much memory ksm saved by reading
>>>>>>>>> /sys/kernel/mm/ksm/pages_sharing, but they don't know what the costs of
>>>>>>>>> ksm cow delay, and this is important of some delay sensitive tasks. If
>>>>>>>>> users know both saved memory and ksm cow delay, they could better use
>>>>>>>>> madvise(, , MADV_MERGEABLE).
>>>>>>>>
>>>>>>>> But that happens after the effects, no?
>>>>>>>>
>>>>>>>> IOW a user already called madvise(, , MADV_MERGEABLE) and then gets the
>>>>>>>> results.
>>>>>>>>
>>>>>>> Image user are developing or porting their applications on experiment
>>>>>>> machine, they could takes those benchmark as feedback to adjust whether
>>>>>>> to use madvise(, , MADV_MERGEABLE) or it's range.
>>>>>>
>>>>>> And why can't they run it with and without and observe performance using
>>>>>> existing metrics (or even application-specific metrics?)?
>>>>>>
>>>>>>
>>>>> I think the reason why we need this patch, is just like why we need                                                                                                     
>>>>> swap,reclaim,thrashing getdelay information. When system is complex,
>>>>> it's hard to precise tell which kernel activity impact the observe
>>>>> performance or application-specific metrics, preempt? cgroup throttle?
>>>>> swap? reclaim? IO?
>>>>>
>>>>> So if we could get the factor's precise impact data, when we are tunning
>>>>> the factor(for this patch it's ksm), it's more efficient.
>>>>>
>>>>
>>>> I'm not convinced that we want to make or write-fault handler more
>>>> complicated for such a corner case with an unclear, eventual use case.
>>>
>>> IIRC, KSM is designed for VM. But recently we found KSM works well for
>>> system with many containers(save about 10%~20% of total memroy), and
>>> container technology is more popular today, so KSM may be used more.
>>>
>>> To reduce the impact for write-fault handler, we may write a new function
>>> with ifdef CONFIG_KSM inside to do this job?
>>
>> Maybe we just want to catch the impact of the write-fault handler when
>> copying more generally?
>>
> We know kernel has different kind of COW, some are transparent for user.
> For example child process may cause COW, and user should not care this
> performance impact, because it's kernel inside mechanism, user is hard
> to do something. But KSM is different, user can do the policy tuning in
> userspace. If we metric all the COW, it may be noise, doesn't it?

Only to some degree I think. The other delays (e.g., SWAP, RECLAIM) are
also not completely transparent to the user, no? I mean, user space
might affect them to some degree with some tunables, but it's not
completely transparent for the user either.

IIRC, we have these sources of COW that result in a r/w anon page (->
MAP_PRIVATE):
(1) R/O-mapped, (possibly) shared anonymous page (fork() or KSM)
(2) R/O-mapped, shared zeropage (e.g., KSM, read-only access to
    unpopulated page in MAP_ANON)
(3) R/O-mapped, shared file/device/... page that requires a private copy
    on modifications (e.g., MAP_PRIVATE !MAP_ANON)

Note that your current patch won't catch when KSM placed the shared
zeropage (use_zero_page).

Tracking the overall overhead might be of value I think, and it would
still allow for determining how much KSM is involved by measuring with
and without KSM enabled.

>>>
>>>> IIRC, whenever using KSM you're already agreeing to eventually pay a
>>>> performance price, and the price heavily depends on other factors in the
>>>> system. Simply looking at the number of write-faults might already give
>>>> an indication what changed with KSM being enabled.
>>>>
>>> While saying "you're already agreeing to pay a performance price", I think
>>> this is the shortcoming of KSM that putting off it being used more widely.
>>> It's not easy for user/app to decide how to use madvise(, ,MADV_MERGEABLE).
>>
>> ... and my point is that the metric you're introducing might absolutely
>> not be expressive for such users playing with MADV_MERGEABLE. IMHO
>> people will look at actual application performance to figure out what
>> "harm" will be done, no?
>>
>> But I do see value in capturing how many COW we have in general --
>> either via a counter or via a delay as proposed by you.
>>
> Thanks for your affirmative. As describe above, or we add a vm counter:
> KSM_COW? 

As I'm messing with the COW logic lately (e.g., [1]) I'd welcome vm
counters for all different kind of COW-related events, especially

(1) COW of an anon, !KSM page
(2) COW of a KSM page
(3) COW of the shared zeropage
(4) Reuse instead of COW

I used some VM counters myself to debug/test some of my latest changes.

>>>
>>> Is there a more easy way to use KSM, enjoying memory saving while minimum
>>> the performance price for container? We think it's possible, and are working
>>> for a new patch: provide a knob for cgroup to enable/disable KSM for all tasks
>>> in this cgroup, so if your container is delay sensitive just leave it, and if
>>> not you can easy to enable KSM without modify app code.
>>>
>>> Before using the new knob, user might want to know the precise impact of KSM.
>>> I think write-faults is indirection. If indirection is good enough, why we need
>>> taskstats and PSI? By the way, getdelays support container statistics.
>>
>> Would anything speak against making this more generic and capturing the
>> delay for any COW, not just for KSM?
> I think we'd better to export data to userspace that is meaning for user.
> User may no need kernel inside mechanism'data.

Reading Documentation/accounting/delay-accounting.rst I wonder what we
best put in there.

"Tasks encounter delays in execution when they wait for some kernel
resource to become available."

I mean, in any COW event we are waiting for the kernel to create a copy.

This could be of value even if we add separate VM counters.

[1]
https://lore.kernel.org/linux-mm/20220315104741.63071-2-david@xxxxxxxxxx/T/

-- 
Thanks,

David / dhildenb