On 10/28/20 4:26 PM, Jens Axboe wrote: > On 10/28/20 4:22 PM, Andrew Morton wrote: >> On Tue, 27 Oct 2020 13:35:51 +0000 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> >>> On Sun, Oct 25, 2020 at 03:08:17PM -0700, akpm@xxxxxxxxxxxxxxxxxxxx wrote: >>>> The patch titled >>>> Subject: mm/filemap/c: freak generic_file_buffered_read up into multiple functions >>>> has been added to the -mm tree. Its filename is >>>> fs-break-generic_file_buffered_read-up-into-multiple-functions.patch >>> >>> Can we back this out? It really makes the THP patchset unhappy. I think >>> we can do something like this afterwards, but doing it this way round is >>> ridiculously hard. >> >> Two concerns: >> >> : On my test box, 4k buffered random reads go from ~150k to ~250k iops, >> : and the improvements to big sequential reads are even bigger. >> >> That's a big improvement! We want that improvement. Throwing it away >> on behalf of an as-yet-unmerged feature patchset hurts. Can we expect that >> this improvement will be available post-that-patchset? And when? >> >> (This improvment is rather hard to believe, really - more details on the >> test environment would be useful. Can we expect that people will in >> general see similar benefits or was there something special about the >> testing?) > > I did see some wins when I tested this. I'll try and run some testing > tomorrow and report back. If there's something specifically you want to > see tested, let me know. I did some testing, unfortunately it's _very_ hard to produce somewhat consistent and good numbers as it quickly becomes a game of kswapd. Here's a basic case of 4 threads doing 32k random reads: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 462 root 20 0 0 0 0 R 65.5 0.0 0:08.02 kswapd0 2287 axboe 20 0 1303448 2176 1072 R 46.6 0.0 0:05.35 fio 2289 axboe 20 0 1303456 2196 1092 D 46.6 0.0 0:05.34 fio 2290 axboe 20 0 1303460 2216 1112 D 46.6 0.0 0:05.37 fio 2288 axboe 20 0 1303452 2224 1120 R 45.9 0.0 0:05.33 fio Sad face... Unfortunately once kswapd kicks in, performance also plummets. This box only has 32G of ram, and you can fill that in less than 10 seconds doing buffered reads like that. I ran 4k and 32k testing, and using 1 and 4 threads. But given the above sadness, it quickly ends up looking the same for me. What I noticed in my initial testing on Kent's patches (which was focused on correctness) was that a read+write verify workload had consistently better read throughput. -- Jens Axboe