Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED

Jens Axboe <axboe@xxxxxxxxx> · Wed, 11 Dec 2019 18:41:44 -0700

On 12/11/19 6:29 PM, Jens Axboe wrote:
> On 12/11/19 6:22 PM, Linus Torvalds wrote:
>> On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>
>>> 15K is likely too slow to really show an issue, I'm afraid. The 970
>>> is no slouch, but your crypt setup will likely hamper it a lot. You
>>> don't have a non-encrypted partition on it?
>>
>> No. I normally don't need all that much disk, so I've never upgraded
>> my ssd from the 512G size.
>>
>> Which means that it's actually half full or so, and I never felt like
>> "I should keep an unencrypted partition for IO testing", since I don't
>> generally _do_ any IO testing.
>>
>> I can get my load up with "numjobs=8" and get my iops up to the 100k
>> range, though.
>>
>> But kswapd doesn't much seem to care, the CPU percentage actually does
>> _down_ to 0.39% when I try that. Probably simply because now my CPU's
>> are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly
>> idle" state ...
>>
>> I guess I should be happy. It does mean that the situation you see
>> isn't exactly the normal case. I understand why you want to do the
>> non-cached case, but the case I think it the worrisome one is the
>> regular buffered one, so that's what I'm testing (not even trying the
>> noaccess patches).
>>
>> So from your report I went "uhhuh, that sounds like a bug". And it
>> appears that it largely isn't - you're seeing it because of pushing
>> the IO subsystem by another order of magnitude (and then I agree that
>> "under those kinds of IO loads, caching just won't help")
> 
> I'd very much argue that it IS a bug, maybe just doesn't show on your
> system. My test box is a pretty standard 2 socket system, 24 cores / 48
> threads, 2 nodes. The last numbers I sent were 100K IOPS, so nothing
> crazy, and granted that's only 10% kswapd cpu time, but that still seems
> very high for those kinds of rates. I'm surprised you see essentially no
> kswapd time for the same data rate.
> 
> We'll keep poking here, I know Johannes is spending some time looking
> into the reclaim side.

Out of curiosity, just tried it on my laptop, which also has some
samsung drive. Using 8 jobs, I get around 100K IOPS too, and this
is my top listing:

23308 axboe     20   0  623156   1304      8 D  10.3  0.0   0:03.81 fio
23309 axboe     20   0  623160   1304      8 D  10.3  0.0   0:03.81 fio
23311 axboe     20   0  623168   1304      8 D  10.3  0.0   0:03.82 fio
23313 axboe     20   0  623176   1304      8 D  10.3  0.0   0:03.82 fio
23314 axboe     20   0  623180   1304      8 D  10.3  0.0   0:03.81 fio
  162 root      20   0       0      0      0 S   9.9  0.0   0:12.97 kswapd0
23307 axboe     20   0  623152   1304      8 D   9.9  0.0   0:03.84 fio
23310 axboe     20   0  623164   1304      8 D   9.9  0.0   0:03.81 fio
23312 axboe     20   0  623172   1304      8 D   9.9  0.0   0:03.80 fio

kswapd is between 9-11% the whole time, and the profile looks very
similar to what I saw on my test box:

    35.79%  kswapd0  [kernel.vmlinux]  [k] xas_create
     9.97%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
     9.94%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
     7.78%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
     3.78%  kswapd0  [kernel.vmlinux]  [k] xas_clear_mark
     3.08%  kswapd0  [kernel.vmlinux]  [k] workingset_eviction
     2.48%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page
     2.06%  kswapd0  [kernel.vmlinux]  [k] page_mapping
     1.95%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping

So now I'm even more puzzled why your (desktop?) doesn't show it, it
must be more potent than my x1 laptop. But for me, the laptop and 2
socket test box show EXACTLY the same behavior, laptop is just too slow
to make it really pathological.

-- 
Jens Axboe