Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> CC: Steven Whitehouse <swhiteho@xxxxxxxxxx> Subject: Re: [RFC][PATCH] Disabling read-ahead makes I/O of large reads small Reply-To: In-Reply-To: <87aax18xms.fsf@xxxxxxxxxxxxxxxxx> Hi Quentin, Quentin Barnes <qbarnes+nfs@xxxxxxxxxxxxx> writes: > Adding the posix_fadvise(..., POSIX_FADV_RANDOM) call sets Have you tried w/o POSIX_FADV_RANDOM (ie. comment out the fadivse call)? It should be able to achieve the same good performance. The heuristic readahead logic should detect that the application is doing random reads. > ra_pages=0. This has a very odd side-effect in the kernel. Once > read-ahead is disabled, subsequent calls to read(2) are now done in > the kernel via ->readpage() callback doing I/O one page at a time! > > Pouring through the code in mm/filemap.c I see that the kernel has > commingled read-ahead and plain read implementations. The algorithms > have much in common, so I can see why it was done, but it left this > anomaly of severely pimping read(2) calls on file descriptors with > read-ahead disabled. > > > For example, with a read(2) of 98K bytes of a file opened with > O_DIRECT accessed over NFSv3 with rsize=32768, I see: > ========= > V3 ACCESS Call (Reply In 249), FH:0xf3a8e519 > V3 ACCESS Reply (Call In 248) > V3 READ Call (Reply In 321), FH:0xf3a8e519 Offset:0 Len:32768 > V3 READ Call (Reply In 287), FH:0xf3a8e519 Offset:32768 Len:32768 > V3 READ Call (Reply In 356), FH:0xf3a8e519 Offset:65536 Len:32768 > V3 READ Reply (Call In 251) Len:32768 > V3 READ Reply (Call In 250) Len:32768 > V3 READ Reply (Call In 252) Len:32768 > ========= > I would expect three READs issued of size 32K, and that's exactly > what I see. > > > For the same file without O_DIRECT but with read-ahead disabled > (its ra_pages=0), I see: > ========= > V3 ACCESS Call (Reply In 167), FH:0xf3a8e519 > V3 ACCESS Reply (Call In 166) > V3 READ Call (Reply In 172), FH:0xf3a8e519 Offset:0 Len:4096 > V3 READ Reply (Call In 168) Len:4096 > V3 READ Call (Reply In 177), FH:0xf3a8e519 Offset:4096 Len:4096 > V3 READ Reply (Call In 173) Len:4096 > V3 READ Call (Reply In 182), FH:0xf3a8e519 Offset:8192 Len:4096 > V3 READ Reply (Call In 178) Len:4096 > [... READ Call/Reply pairs repeated another 21 times ...] > ========= > Now I see 24 READ calls of 4K each! Good catch, Thank you very much! > A workaround for this kernel problem is to hack the app to do a > readahead(2) call prior to the read(2), however, I would think a > better approach would be to fix the kernel. I came up with the > included patch that once applied restores the expected read(2) > behavior. For the latter test case above of a file with read-ahead > disabled but now with the patch below applied, I now see: > ========= > V3 ACCESS Call (Reply In 1350), FH:0xf3a8e519 > V3 ACCESS Reply (Call In 1349) > V3 READ Call (Reply In 1387), FH:0xf3a8e519 Offset:0 Len:32768 > V3 READ Call (Reply In 1421), FH:0xf3a8e519 Offset:32768 Len:32768 > V3 READ Call (Reply In 1456), FH:0xf3a8e519 Offset:65536 Len:32768 > V3 READ Reply (Call In 1351) Len:32768 > V3 READ Reply (Call In 1352) Len:32768 > V3 READ Reply (Call In 1353) Len:32768 > ========= > Which is what I would expect -- back to just three 32K READs. > > After this change, the overall performance of the application > increased by 313%! And awesome improvements! > > I have no idea if my patch is the appropriate fix. I'm well out of > my area in this part of the kernel. It solves this one problem, but > I have no idea how many boundary cases it doesn't cover or even if > it is the right way to go about addressing this issue. > > Is this behavior of shorting I/O of read(2) considered a bug? And > is this approach for a fix approriate? The approach is mostly OK for the bug. However one issue is missed -- the ra_pages is somehow overloaded. I try to fix the problems in the two patches just posted. Will that solve your problem? Thanks, Fengguang > > --- linux-2.6.32.2/mm/filemap.c 2009-12-18 16:27:07.000000000 -0600 > +++ linux-2.6.32.2-rapatch/mm/filemap.c 2009-12-24 13:07:07.000000000 -0600 > @@ -1012,9 +1012,13 @@ static void do_generic_file_read(struct > find_page: > page = find_get_page(mapping, index); > if (!page) { > - page_cache_sync_readahead(mapping, > - ra, filp, > - index, last_index - index); > + if (ra->ra_pages) > + page_cache_sync_readahead(mapping, > + ra, filp, > + index, last_index - index); > + else > + force_page_cache_readahead(mapping, filp, > + index, last_index - index); > page = find_get_page(mapping, index); > if (unlikely(page == NULL)) > goto no_cached_page; -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html