Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table

David Hildenbrand <david@xxxxxxxxxx> · Sat, 21 May 2022 22:28:59 +0200

On 21.05.22 20:50, Chih-En Lin wrote:
> On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
>> On 19.05.22 20:31, Chih-En Lin wrote:
>>> When creating the user process, it usually uses the Copy-On-Write (COW)
>>> mechanism to save the memory usage and the cost of time for copying.
>>> COW defers the work of copying private memory and shares it across the
>>> processes as read-only. If either process wants to write in these
>>> memories, it will page fault and copy the shared memory, so the process
>>> will now get its private memory right here, which is called break COW.
>>
>> Yes. Lately we've been dealing with advanced COW+GUP pinnings (which
>> resulted in PageAnonExclusive, which should hit upstream soon), and
>> hearing about COW of page tables (and wondering how it will interact
>> with the mapcount, refcount, PageAnonExclusive of anonymous pages) makes
>> me feel a bit uneasy :)
> 
> I saw the series patch of this and knew how complicated handling COW of
> the physical page was [1][2][3][4]. So the COW page table will tend to
> restrict the sharing only to the page table. This means any modification
> to the physical page will trigger the break COW of page table.
> 
> Presently implementation will only update the physical page information
> to the RSS of the owner process of COW PTE. Generally owner is the
> parent process. And the state of the page, like refcount and mapcount,
> will not change under the COW page table.
> 
> But if any situations will lead to the COW page table needs to consider
> the state of physical page, it might be fretful. ;-)

I haven't looked into the details of how GUP deals with these COW page
tables. But I suspect there might be problems with page pinning:
skipping copy_present_page() even for R/O pages is usually problematic
with R/O pinnings of pages. I might be just wrong.

> 
>>>
>>> Presently this kind of technology is only used as the mapping memory.
>>> It still needs to copy the entire page table from the parent.
>>> It might cost a lot of time and memory to copy each page table when the
>>> parent already has a lot of page tables allocated. For example, here is
>>> the state table for mapping the 1 GB memory of forking.
>>>
>>> 	    mmap before fork         mmap after fork
>>> MemTotal:       32746776 kB             32746776 kB
>>> MemFree:        31468152 kB             31463244 kB
>>> AnonPages:       1073836 kB              1073628 kB
>>> Mapped:            39520 kB                39992 kB
>>> PageTables:         3356 kB                 5432 kB
>>
>>
>> I'm missing the most important point: why do we care and why should we
>> care to make our COW/fork implementation even more complicated?
>>
>> Yes, we might save some page tables and we might reduce the fork() time,
>> however, which specific workload really benefits from this and why do we
>> really care about that workload? Without even hearing about an example
>> user in this cover letter (unless I missed it), I naturally wonder about
>> relevance in practice.
>>
>> I assume it really only matters if we fork() realtively large processes,
>> like databases for snapshotting. However, fork() is already a pretty
>> sever performance hit due to COW, and there are alternatives getting
>> developed as a replacement for such use cases (e.g., uffd-wp).
>>
>> I'm also missing a performance evaluation: I'd expect some simple
>> workloads that use fork() might be even slower after fork() with this
>> change.
>>
> 
> The paper mentioned a list of benchmarks of the time cost for On-Demand
> fork. For example, on Redis, the meantime of fork when taking the
> snapshot. Default fork() got 7.40 ms; On-demand Fork (COW PTE table) got
> 0.12 ms. But there are some other cases, like the Response latency
> distribution of Apache HTTP Server, are not have significant benefits
> from their On-demand fork.

Thanks. I expected that snapshotting would pop up and be one of the most
prominent users that could benefit. However, for that specific use case
I am convinced that uffd-wp is the better choice and fork() is just the
old way of doing it. having nothing better at hand. QEMU already
implements snapshotting of VMs that way and I remember that redis also
intended to implement support for uffd-wp. Not sure what happened with
that and if there is anything missing to make it work.

> 
> For the COW page table from this patch, I also take the perf to analyze
> the cost time. But it looks like not different from the default fork.

Interesting, thanks for sharing.

> 
> Here is the report, the mmap-sfork is COW page table version:
> 
>  Performance counter stats for './mmap-fork' (100 runs):
> 
>             373.92 msec task-clock                #    0.992 CPUs utilized            ( +-  0.09% )
>                  1      context-switches          #    2.656 /sec                     ( +-  6.03% )
>                  0      cpu-migrations            #    0.000 /sec
>                881      page-faults               #    2.340 K/sec                    ( +-  0.02% )
>      1,860,460,792      cycles                    #    4.941 GHz                      ( +-  0.08% )
>      1,451,024,912      instructions              #    0.78  insn per cycle           ( +-  0.00% )
>        310,129,843      branches                  #  823.559 M/sec                    ( +-  0.01% )
>          1,552,469      branch-misses             #    0.50% of all branches          ( +-  0.38% )
> 
>           0.377007 +- 0.000480 seconds time elapsed  ( +-  0.13% )
> 
>  Performance counter stats for './mmap-sfork' (100 runs):
> 
>             373.04 msec task-clock                #    0.992 CPUs utilized            ( +-  0.10% )
>                  1      context-switches          #    2.660 /sec                     ( +-  6.58% )
>                  0      cpu-migrations            #    0.000 /sec
>                877      page-faults               #    2.333 K/sec                    ( +-  0.08% )
>      1,851,843,683      cycles                    #    4.926 GHz                      ( +-  0.08% )
>      1,451,763,414      instructions              #    0.78  insn per cycle           ( +-  0.00% )
>        310,270,268      branches                  #  825.352 M/sec                    ( +-  0.01% )
>          1,649,486      branch-misses             #    0.53% of all branches          ( +-  0.49% )
> 
>           0.376095 +- 0.000478 seconds time elapsed  ( +-  0.13% )
> 
> So, the COW of the page table may reduce the time of forking. But it
> builds on the transfer of the copy work to other modified operations
> to the physical page.

Right.

> 
>> I have tons of questions regarding rmap, accounting, GUP, page table
>> walkers, OOM situations in page walkers, but at this point I am not
>> (yet) convinced that the added complexity is really worth it. So I'd
>> appreciate some additional information.
> 
> It seems like I have a lot of work to do. ;-)

Messing with page tables and COW is usually like opening a can of worms :)

-- 
Thanks,

David / dhildenb