On Fri, Sep 09, 2016 at 12:55:21AM +0100, Al Viro wrote: > On Fri, Sep 09, 2016 at 07:38:35AM +1000, Dave Chinner wrote: > > > It's not an XFS specific problem: any filesystem that supports hole > > punch and it's fallocate() friends needs this high level splice IO > > exclusion as well. > > How is hole punch different from truncate()? My reading of the situation > is that we don't need exclusion between that and insertion into pipe; > only for "gather uptodate page references" part. If some page gets > evicted afterwards... how is that different from having that happen > right after we'd finished with ->splice_read()? Am I missing something > subtle in there? generic_file_splice_read() gathers pages into spd.pages[], taking a refernce to them. The pages are not locked. truncate does things in this order: move EOF, invalidate page cache, free disk space So if we race iwth a truncate, the pages in spd.pages[] that are beyond the new EOF may or may not have been removed from the page cache. The splice code handles this specific race condition by again checking the uptodate page against the current EOF before updating the spd to include it: fill_it: /* * i_size must be checked after PageUptodate. */ isize = i_size_read(mapping->host); end_index = (isize - 1) >> PAGE_SHIFT; if (unlikely(!isize || index > end_index)) break; At this point, if the page is inside isize we know it has good data in it, and we can hand it off to whoever. The problem with hole punch or an extent shift is that the size does not change and so the invalidated page is still within the valid range of the file. Hence if we race with invalidation here, it does not get caught and what we put into the buffer does not reflect the data in the file at the time the pipe buffer is built. This isn't specific to splice - it's the same issue for all page cache lookup and validation checks. This issue is one of the reasons why XFS has a MMAPLOCK similar to the IOLOCK - we can't take the IOLOCK in the page fault path, but we still need to protect page faults against racing page invalidations within EOF from operations like hole punch. > Again, what I propose is a new iov_iter flavour. Backed by pipe_buffer array, > used only for reads (i.e. copy to, not copy from). Three states for element: > pagecache one, copied data, empty. Semantics: > * copy_page_to_iter(): grab a reference to page and stick it into > the next element (making it a pagecache one) with offset and len coming > directly from arguments. > * copy_to_iter(): if the last element is a 'copied data' with empty > space remaining - copy to the end. Otherwise allocate a new page and stick > it into the next element (making it 'copied data'), then copy into it. If > still not all data copied, do the same for the next element, etc. Of course, > if there's no elements left, we are done copying. > * zero_iter(): ditto, with s/copy/fill with zeroes/ > * iov_iter_get_pages(): allocate pages, stick them into the next > slots (making those 'copied data'). That might need some changes, though - > I'm still looking through the users. The tricky part is decision when to > update the lengths. > * iov_iter_get_pages_alloc(): not sure, hadn't really looked yet. > * iov_iter_alignment(): probably just returns 0. > * iov_iter_advance(): probably like bvec variant. > Sounds reasonable, but the iter stuff makes my head hurt so I haven't thought about it that deeply yet. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs