Re: [PATCH 08/13] fs: add read support for RWF_UNCACHED

Jens Axboe <axboe@xxxxxxxxx> · Mon, 11 Nov 2024 07:10:28 -0700

On 11/11/24 6:04 AM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>> If the same test case is run with RWF_UNCACHED set for the buffered read,
>> the output looks as follows:
>>
>> Reading bs 65536, uncached 0
>>    1s: 153144MB/sec
>>    2s: 156760MB/sec
>>    3s: 158110MB/sec
>>    4s: 158009MB/sec
>>    5s: 158043MB/sec
>>    6s: 157638MB/sec
>>    7s: 157999MB/sec
>>    8s: 158024MB/sec
>>    9s: 157764MB/sec
>>   10s: 157477MB/sec
>>   11s: 157417MB/sec
>>   12s: 157455MB/sec
>>   13s: 157233MB/sec
>>   14s: 156692MB/sec
>>
>> which is just chugging along at ~155GB/sec of read performance. Looking
>> at top, we see:
>>
>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>> 7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
>> 8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top
>>
>> where just the test app is using CPU, no reclaim is taking place outside
>> of the main thread. Not only is performance 65% better, it's also using
>> half the CPU to do it.
> 
> Do you have numbers of similar code using O_DIRECT just to
> see the impact of the memcpy from the page cache to the userspace
> buffer...

I don't, but I can surely generate those. I didn't consider them that
interesting for this comparison which is why I didn't do them, O_DIRECT
reads for bigger blocks sizes (or even smaller block sizes, if using
io_uring + registered buffers) will definitely have lower overhead than
uncached and buffered IO. Copying 160GB/sec isn't free :-)

For writes it's a bit more complicated to do an apples to apples
comparison, as uncached IO isn't synchronous like O_DIRECT is. It only
kicks off the IO, doesn't wait for it.

-- 
Jens Axboe