On 20.08.21 22:25, Peter Xu wrote:
Hi, Tiberiu,
On Fri, Aug 20, 2021 at 05:10:20PM +0000, Tiberiu Georgescu wrote:
Currently, the missing information for shmem is this:
1. Difference between is_swap(pte) and is_none(pte).
* is_swap(pte) is always false;
* is_none(pte) is true when is_swap() should have been;
* is_present(pte) is fine.
2. swp_entry(pte)
Particularly, swp_type() and swp_offset().
3. SOFT_DIRTY_BIT
This is not always missing for shmem.
Once 4 is written to clear_refs, if the page is dirtied, the bit is fine as long as it
is still in memory. If the page is swapped out, the bit is lost. Then, if the page is
brought back into memory, the bit is still lost.
For 1, you mentioned how lseek() and madvise() can be used to get this
information [2], and I proposed a different method with a little help from
the current pagemap[3]. They have slightly different output and applications, so
the difference should be taken into consideration.
For 2, if anyone knows of any way of retrieve the missing information cleanly,
please let us know.
As for 3, AFAIK, we will need to leverage Peter's special PTE marker mechanism
and implement it in another patch.
[2]: https://lore.kernel.org/lkml/5766d353-6ff8-fdfa-f8f9-764e8de9b5aa@xxxxxxxxxx/
[3]: https://lore.kernel.org/lkml/B130B700-B3DB-4D07-A632-73030BCBC715@xxxxxxxxxxx/
============================
For completeness, I would like to mention Peter's RFC[4] and my own patch[5],
which deal with adding missing functionality to the pagemap when pages are
shmem/tmpfs.
Peter's patch[4] adds the missing information at 1 to the pagemap, with very little performance overhead. AFAIK, it is still WIP.
My patch[5] fixes both 1 and 2, at the expense of a significant loss in performance
when dealing with swapped out shared pages. This performance loss can be
reduced with batching, for use cases when high performance matters. Also, this
patch on top of Peter's RFC yields better performance[6]. Still 2x as slow on
average compared to pre-patch.
Peter's patch has a config flag, and I intend to add one to mine in the next
version. So I wanted to propose, if alternatives are not implemented yet (mincore,
lseek, map_files or otherwise are insufficient), we upstream our patches (once
they are ready), so that users can toggle them on or off, depending on whether
they need the extra functionality or not. And, of course, document their usage.
If neither sounds like a particularly useful/convenient option, we might need to
look into designs of retrieving the missing information via another mechanism
(sys/fs, ioctl, netlink etc).
That is, unless we find that we can/should place this info in the pagemap still, for
the sake of correctness and completeness. For that though, we should convene
on what do we expect the pagemap to do in the end. Is shmem/tmpfs out of
bounds for it or not?
[4]: https://lore.kernel.org/lkml/20210807032521.7591-1-peterx@xxxxxxxxxx/
[5]: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@xxxxxxxxxxx/
[6]: https://lore.kernel.org/lkml/C0DB3FED-F779-4838-9697-D05BE96C3514@xxxxxxxxxxx/
Thanks for summarizing the issues.
Before going further, I really would like to understand a few questions that I
already raised in the other thread here:
https://lore.kernel.org/lkml/YR%2F+gfL8RCP8XoB1@t490s/
They're:
(1) Whether does mincore() suit your need already?
(2) What would you like to do with swap entries in pagemap?
I'm more interested in question (2) because I never figured it out before, and
I really don't see how it would work even if the kernel can share swap format
to userspace. E.g., right after you decided to "zero copy" that page, the page
can be faulted in right before live migration finishes, and it can be dirtied
again. Then the page on the shared network storage will be stall, the same to
the swap entry you just scanned.
I wonder if one should much rather try using shared file-backed memory
located on a network storage instead of hacking into swap here.
--
Thanks,
David / dhildenb