Re: bcache on XFS: metadata I/O (dirent I/O?) not getting cached at all?

Nix <nix@xxxxxxxxxxxxx> · Wed, 13 Feb 2019 14:39:45 +0000

On 13 Feb 2019, Kai Krakow verbalised:

> Am Mi., 13. Feb. 2019 um 01:22 Uhr schrieb Nix <nix@xxxxxxxxxxxxx>:
>> > Here's my branch:
>> > https://github.com/kakra/linux/compare/master...kakra:rebase-4.20/bcache-updates
>>
>> Looks to be fixed there. Maybe you found a later version of the patches
>> than I did :) I derived mine from ewheelerinc's
>> for-4.10-block-bcache-updates, but even
>> bcache-updates-linux-block-for-4.13 seems to have the same bug, as does
>> bcache-updates-linux-block-for-next.
>>
>> Which branch did you rebase from? Maybe I should respin from the same
>> one (or probably just use your branch :) ).
>
> I used the same base but I'm carrying around those patches since then,
> rebased through several kernel versions. I think Eric also jumped in
> once a commented on some corrections that should be made. I just
> followed what I was reading.
>
> Feel free to use that branch, it also has some fixes that are queued for 5.1.

I will probably switch...

>> > There's still a problem with bcache doing writebacks very very slowly,
>> > at only 4k/s. My system generates more than 4k/s writes thus it will
>> > eventually never finish writing back dirty data.
>>
>> That seems... very bad.
>
> It can be. It has downsides: On a busy system, writeback should kick
> in only when idle to not delay read IO.

Yeah, except if you're emitting huge quantities of I/O (vapoursynth
video processing, I'm looking at you: only thing I've ever done that
emits a terabyte of data at once, then reads it straight back in again:
things like *that*, and copious object files etc from builds, are why I
have an uncached, unjournalled RAID-0 ext4 fs on the fastest 250GiB of
each disk in the array, bind-mounted in as needed for transient fs
operations: if fsck finds problems on it it just gets automatically
remkfsed.)

> For optimally ordered IOs I've seen 800 MB/s here, but usually it
> peaks at around 60-80 MB/s for writes when doing Steam downloads (tho

It definitely sounds like your writes ultimately come from the Internet
mostly: mine are, ah, endogenously generated (compiles, massive text
files with awk or readelf output being chewed over by scripts,
vapoursynth video transcodes, etc) so can be almost arbitrarily huge and
fast.

> Setup: bcache 400 GB SSD + 4x HDD btrfs RAID-0.

Mine is a 350GiB of SSD devoted to this (and a bunch of the same SSD
devoted to other stuff), and 6x 8TiB in a RAID-6, about 2/3rds of which
is bcached (but that is still mostly empty, because, well, even after
the RAID-6 overhead there's 14TiB of it!)

>> Mine is still only 8GiB used out of 340. I think I might boost the
>> bypass figures -- perhaps setting it identical to the RAID stripe size
>> was a bad idea? (Though I thought there was a preference for full-stripe
>> *writes*, not reads, even if XFS does know about the RAID topology.)
>
> I'm not sure if XFS could really discover the lower-layers topology
> through bcache...

Indeed it can't. So you have to tell it at mkfs time:

mkfs.xfs -m rmapbt=1,reflink=1 \
         -d agcount=17,sunit=$((128*8)),swidth=$((384*8)) \
         -l logdev=/dev/sde3,size=521728b \
         -i sparse=1,maxpct=25 /dev/main/root

There are very few reasons to hand-specify parameters to mkfs.xfs these
days, unless you want to flip on experimental features or something (as
rmapbt and reflinks were when I did this). agcount/sunit/swidth on RAID
arrays where it can't see the topology because bcache is in the way is
one of them.

(External journals are another. *Obviously*, that '521728b' is the same
as saying '2G'. But it's certain to be block-accurate which I sort of
cared about here. I was working in units of 4KiB fs blocks/HDD sectors
the whole time.)