On Thu, 18 Dec 2008 16:29:52 +0100 Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote: > > On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote: > > > The solution either involves synchronising forks and get_user_pages, > > > or probably better, to do copy on fork rather than COW in the case > > > that we detect a page is subject to get_user_pages. The trick is in > > > the details :) > > > From: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Subject: fork-o_direct-race > > Think a thread writing constantly to the last 512bytes of a page, while another > thread read and writes to/from the first 512bytes of the page. We can lose > O_DIRECT reads, the very moment we mark any pte wrprotected because a third > unrelated thread forks off a child. > > This fixes it by never wprotecting anon ptes if there can be any direct I/O in > flight to the page, and by instantiating a readonly pte and triggering a COW in > the child. The only trouble here are O_DIRECT reads (writes to memory, read > from disk). Checking the page_count under the PT lock guarantees no > get_user_pages could be running under us because if somebody wants to write to > the page, it has to break any cow first and that requires taking the PT lock in > follow_page before increasing the page count. > > The COW triggered inside fork will run while the parent pte is read-write, this > is not usual but that's ok as it's only a page copy and it doesn't modify the > page contents. > > In the long term there should be a smp_wmb() in between page_cache_get and > SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the > PageSwapCache and the page_count() to remove the trylock op. > > Fixed version of original patch from Nick Piggin. > > Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> Confirmed this fixes the problem. Hmm, but, fork() gets slower. Result of cost-of-fork() on ia64. == size of memory before after Anon=1M , 0.07ms, 0.08ms Anon=10M , 0.17ms, 0.22ms Anon=100M , 1.15ms, 1.64ms Anon=1000M , 11.5ms, 15.821ms == fork() cost is 135% when the process has 1G of Anon. test program is below. (used "/usr/bin/time" for measurement.) == #include <stdlib.h> #include <sys/types.h> #include <sys/wait.h> int main(int argc, char *argv[]) { int size, i, status; char *c; size = atoi(argv[1]) * 1024 * 1024; c = malloc(size); memset(c, 0,size); for (i = 0; i < 5000; i++) { if (!fork()) { exit(0); } wait(&status); } } == -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html