On 20.11.24 16:58, Nikita Kalyazin wrote:
On 20/11/2024 15:13, David Hildenbrand wrote:
> Hi!
Hi! :)
>> Results:
>> - MAP_PRIVATE: 968 ms
>> - MAP_SHARED: 1646 ms
>
> At least here it is expected to some degree: as soon as the page cache
> is involved map/unmap gets slower, because we are effectively
> maintaining two datastructures (page tables + page cache) instead of
> only a single one (page cache)
>
> Can you make sure that THP/large folios don't interfere in your
> experiments (e.g., madvise(MADV_NOHUGEPAGE))?
I was using transparent_hugepage=never command line argument in my testing.
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
Is that sufficient to exclude the THP/large folio factor?
Yes!
>> While this logic is intuitive, its performance effect is more
>> significant that I would expect.
>
> Yes. How much of the performance difference would remain if you hack out
> the atomic op just to play with it? I suspect there will still be some
> difference.
I have tried that, but could not see any noticeable difference in the
overall results.
It looks like a big portion of the bottleneck has moved from
shmem_get_folio_gfp/folio_mark_uptodate to
finish_fault/__pte_offset_map_lock somehow. I have no good explanation
for why:
That's what I assumed. The profiling results can be rather fuzzy and
misleading with micro-benchmarks. :(
Orig:
- 69.62% do_fault
+ 44.61% __do_fault
+ 20.26% filemap_map_pages
+ 3.48% finish_fault
Hacked:
- 67.39% do_fault
+ 32.45% __do_fault
+ 21.87% filemap_map_pages
+ 11.97% finish_fault
Orig:
- 3.48% finish_fault
- 1.28% set_pte_range
0.96% folio_add_file_rmap_ptes
- 0.91% __pte_offset_map_lock
0.54% _raw_spin_lock
Hacked:
- 11.97% finish_fault
- 8.59% __pte_offset_map_lock
- 6.27% _raw_spin_lock
preempt_count_add
1.00% __pte_offset_map
- 1.28% set_pte_range
- folio_add_file_rmap_ptes
__mod_node_page_state
> Note that we might improve allocation times with guest_memfd when
> allocating larger folios.
I suppose it may not always be an option depending on requirements to
consistency of the allocation latency. Eg if a large folio isn't
available at the time, the performance would degrade to the base case
(please correct me if I'm missing something).
Yes, there are cons to that.
Heh, now I spot that your comment was as reply to a series.
Yeah, sorry if it wasn't obvious.
If your ioctl is supposed to to more than "allocating memory" like
MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
Because for allocating memory, we would want to use fallocate() instead.
I assume you want to "allocate+copy"?
Yes, the ultimate use case is "allocate+copy".
I'll note that, as we're moving into the direction of moving
guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
ioctls, and think about something generic.
Good point, thanks. Are we at the stage where some concrete API has
been proposed yet? I might have missed that.
People are working on it, and we're figuring out some remaining details
(e.g., page_type to intercept folio_put() ). I assume we'll see a new
RFC soonish (famous last words), but it's not been proposed yet.
Any clue how your new ioctl will interact with the WIP to have shared
memory as part of guest_memfd? For example, could it be reasonable to
"populate" the shared memory first (via VMA) and then convert that
"allocated+filled" memory to private?
No, I can't immediately see why it shouldn't work. My main concern
would probably still be about the latency of the population stage as I
can't see why it would improve compared to what we have now, because my
> feeling is this is linked with the sharedness property of guest_memfd.
If the problem is the "pagecache" overhead, then yes, it will be a
harder nut to crack. But maybe there are some low-hanging fruits to
optimize? Finding the main cause for the added overhead would be
interesting.
--
Cheers,
David / dhildenb