On 11/11/24 6:04 AM, Stefan Metzmacher wrote: > Hi Jens, > >> If the same test case is run with RWF_UNCACHED set for the buffered read, >> the output looks as follows: >> >> Reading bs 65536, uncached 0 >> 1s: 153144MB/sec >> 2s: 156760MB/sec >> 3s: 158110MB/sec >> 4s: 158009MB/sec >> 5s: 158043MB/sec >> 6s: 157638MB/sec >> 7s: 157999MB/sec >> 8s: 158024MB/sec >> 9s: 157764MB/sec >> 10s: 157477MB/sec >> 11s: 157417MB/sec >> 12s: 157455MB/sec >> 13s: 157233MB/sec >> 14s: 156692MB/sec >> >> which is just chugging along at ~155GB/sec of read performance. Looking >> at top, we see: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached >> 8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top >> >> where just the test app is using CPU, no reclaim is taking place outside >> of the main thread. Not only is performance 65% better, it's also using >> half the CPU to do it. > > Do you have numbers of similar code using O_DIRECT just to > see the impact of the memcpy from the page cache to the userspace > buffer... I don't, but I can surely generate those. I didn't consider them that interesting for this comparison which is why I didn't do them, O_DIRECT reads for bigger blocks sizes (or even smaller block sizes, if using io_uring + registered buffers) will definitely have lower overhead than uncached and buffered IO. Copying 160GB/sec isn't free :-) For writes it's a bit more complicated to do an apples to apples comparison, as uncached IO isn't synchronous like O_DIRECT is. It only kicks off the IO, doesn't wait for it. -- Jens Axboe