On 29.01.24 23:46, Mike Snitzer wrote:
On Mon, Jan 29 2024 at 5:12P -0500,
Dave Chinner <david@xxxxxxxxxxxxx> wrote:
On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote:
While I'm sure this legacy application would love to not have to
change its code at all, I think we can all agree that we need to just
focus on how best to advise applications that have mixed workloads
accomplish efficient mmap+read of both sequential and random.
To that end, I heard Dave clearly suggest 2 things:
1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
bdi->io_pages, not bdi->ra_pages * 2
2) Have the application first issue MADV_SEQUENTIAL to convey that for
the following MADV_WILLNEED is for sequential file load (so it is
desirable to use larger ra_pages)
This overrides the default of bdi->ra_pages and _should_ provide the
required per-file duality of control for readahead, correct?
I just discovered MADV_POPULATE_READ - see my reply to Ming
up-thread about that. The applicaiton should use that instead of
MADV_WILLNEED because it gives cache population guarantees that
WILLNEED doesn't. Then we can look at optimising the performance of
MADV_POPULATE_READ (if needed) as there is constrained scope we can
optimise within in ways that we cannot do with WILLNEED.
Nice find! Given commit 4ca9b3859dac ("mm/madvise: introduce
MADV_POPULATE_(READ|WRITE) to prefault page tables"), I've cc'd David
Hildenbrand just so he's in the loop.
Thanks for CCing me.
MADV_POPULATE_READ is indeed different; it doesn't give hints (not
"might be a good idea to read some pages" like MADV_WILLNEED documents),
it forces swapin/read/.../.
In a sense, MADV_POPULATE_READ is similar to simply reading one byte
from each PTE, triggering page faults. However, without actually reading
from the target pages.
MADV_POPULATE_READ has a conceptual benefit: we know exactly how much
memory user space wants to have populated (which range). In contrast,
page faults contain no such hints and we have to guess based on
historical behavior. One could use that range information to *not* do
any faultaround/readahead when we come via MADV_POPULATE_READ, and
really only popoulate the range of interest.
Further, one can use that range information to allocate larger folios,
without having to guess where placement of a large folio is reasonable,
and which size we should use.
FYI, I proactively raised feedback and questions to the reporter of
this issue:
CONTEXT: madvise(WILLNEED) doesn't convey the nature of the access,
sequential vs random, just the range that may be accessed.
Indeed. The "problem" with MADV_SEQUENTIAL/MADV_RANDOM is that it will
fragment/split VMAs. So applying it to smaller chunks (like one would do
with MADV_WILLNEED) is likely not a good option.
--
Cheers,
David / dhildenb