On Fri, Sep 18, 2020 at 01:59:41PM -0700, Linus Torvalds wrote: > Honestly, if we had a completely *reliable* sign of "this page is > pinned", then I think the much nicer option would be to just say > "pinned pages will not be copied at all". Kind of an implicit > VM_DONTCOPY. It would be simpler to implement, but it makes the programming model really sketchy. For instance O_DIRECT is using FOLL_PIN, so imagine this program: CPU0 CPU1 a = malloc(1024); b = malloc(1024); read(fd, a, 1024); // FD is O_DIRECT ... fork() *b = ... read completes Here a and b got lucky and both come from the same page due to the allocator. In this case the fork() child in CPU1, would be very surprised that 'b' was not mapped into the fork. Similiarly, CPU0 would have silent data corruption if the read didn't deposit data into 'a' - which is a bug we have today. In this race the COW break of *b might steal the physical page to the child, and *a won't see the data. For this reason, John is right, fork needs to eventually do this for O_DIRECT as well. The copy on fork nicely fixes all of this weird oddball stuff. Jason