On 2/2/23 07:15, David Howells wrote:
David Hildenbrand <david@xxxxxxxxxx> wrote:
At first, I wondered if that's related to shared anonymous pages getting
pinned R/O that would trigger COW-unsharing ... but I don't even see where we
are supposed to use FOLL_PIN vs. FOLL_GET here? IOW, we're not even supposed
to access user space memory (neither FOLL_GET nor FOLL_PIN) but still end up
with a change in behavior.
I'm not sure that using FOLL_PIN is part of the problem here.
I agree. It's really not.
sendfile() is creating a transient buffer attached to a pipe, doing a DIO read
into it (which uses iov_iter_extract_pages() to populate a bio) and then feeds
the pages from the pipe one at a time using a BVEC-type iterator to a buffered
write.
Note that iov_iter_extract_pages() does not pin/get pages when accessing a
BVEC, KVEC, XARRAY or PIPE iterator. However, in this case, this should not
matter as the pipe is holding refs on the pages in the buffer.
I have found that the issue goes away if I add an extra get_page() into
iov_iter_extract_pipe_pages() and don't release it (the page is then leaked).
This makes me think that the problem might be due to the pages getting
recycled from the pipe before DIO has finished writing to them - but that
shouldn't be the case as the splice has to be synchronous - we can't mark data
in the pipe as 'produced' until we've finished reading it.
So I thought about this for a while, and one really big glaring point
that stands out for me is: before this commit, we had this:
iov_iter_get_pages()
__iov_iter_get_pages_alloc()
pipe_get_pages()
get_page()
But now, based on the claim from various folks that "pipe cases don't
require a get_page()", we have boldly--too boldy, I believe-- moved
directly into case that doesn't do a get_page():
iov_iter_extract_pipe_pages()
...(nothing)
And your testing backs this up: adding the get_page() back hides the
failure.
So that's as far as I've got, but I am really suspecting that this is
where the root cause is: pipe references are not as locked down as we
think they are. At least in this case.
thanks,
--
John Hubbard
NVIDIA