On Sat, May 16, 2015 at 03:38:04PM -0700, David Lang wrote: > On Fri, 15 May 2015, Mel Gorman wrote: > > >On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote: > >> > >> > >>On 05/15/2015 01:09 AM, Mel Gorman wrote: > >>>On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote: > >>>>On 05/14/2015 08:06 PM, Daniel Phillips wrote: > >>>>>>The issue is that things like ptrace, AIO, infiniband > >>>>>>RDMA, and other direct memory access subsystems can take > >>>>>>a reference to page A, which Tux3 clones into a new page B > >>>>>>when the process writes it. > >>>>>> > >>>>>>However, while the process now points at page B, ptrace, > >>>>>>AIO, infiniband, etc will still be pointing at page A. > >>>>>> > >>>>>>This causes the process and the other subsystem to each > >>>>>>look at a different page, instead of at shared state, > >>>>>>causing ptrace to do nothing, AIO and RDMA data to be > >>>>>>invisible (or corrupted), etc... > >>>>> > >>>>>Is this a bit like page migration? > >>>> > >>>>Yes. Page migration will fail if there is an "extra" > >>>>reference to the page that is not accounted for by > >>>>the migration code. > >>> > >>>When I said it's not like page migration, I was referring to the fact > >>>that a COW on a pinned page for RDMA is a different problem to page > >>>migration. The COW of a pinned page can lead to lost writes or > >>>corruption depending on the ordering of events. > >> > >>I see the lost writes case, but not the corruption case, > > > >Data corruption can occur depending on the ordering of events and the > >applications expectations. If a process starts IO, RDMA pins the page > >for read and forks are combined with writes from another thread then when > >the IO completes the reads may not be visible. The application may take > >improper action at that point. > > if tux3 forks the page and writes the copy while the original page > is being modified by other things, this means that some of the > changes won't be in the version written (and this could catch > partial writes with 'interesting' results if the forking happens at > the wrong time) > Potentially yes. There is likely to be some elevated memory usage but I imagine that can be controlled. > But if the original page gets re-marked as needing to be written out > when it's changed by one of the other things that are accessing it, > there shouldn't be any long-term corruption. > > As far as short-term corruption goes, any time you have a page > mmapped it could get written out at any time, with only some of the > application changes applied to it, so this sort of corruption could > happen anyway couldn't it? > That becomes the responsibility of the application. It's up to it to sync appropriately when it knows updates are complete. > >Users of RDMA are typically expected to use MADV_DONTFORK to avoid this > >class of problem. > > > >You can choose to not define this as data corruption because thge kernel > >is not directly involved and that's your call. > > > >>Do you > >>mean corruption by changing a page already in writeout? If so, > >>don't all filesystems have that problem? > >> > > > >No, the problem is different. Backing devices requiring stable pages will > >block the write until the IO is complete. For those that do not require > >stable pages it's ok to allow the write as long as the page is dirtied so > >that it'll be written out again and no data is lost. > > so if tux3 is prevented from forking the page in cases where the > write would be blocked, and will get forked again for follow-up > writes if it's modified again otherwise, won't this be the same > thing? > Functionally and from a correctness point of view, it *might* be equivalent. It depends on the implementation and the page life cycle, particularly the details of how the writeback and dirty state are coordinated between the user-visible pages and the page being written back. I've read none of the code or background so I cannot answer whether it's really equivalent or not. Just be aware that it's not the same problem as page migration and that it's not the same as how writeback and dirty state is handled today. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html