Re: bcache on XFS: metadata I/O (dirent I/O?) not getting cached at all?

Kai Krakow <kai@xxxxxxxxxxx> · Wed, 13 Feb 2019 01:46:41 +0100

Am Mi., 13. Feb. 2019 um 01:22 Uhr schrieb Nix <nix@xxxxxxxxxxxxx>:
>
> On 13 Feb 2019, Kai Krakow told this:
>
> > Am Do., 7. Feb. 2019 um 21:51 Uhr schrieb Nix <nix@xxxxxxxxxxxxx>:
> >> btw I have ported ewheeler's ioprio-based cache hinting patch to 4.20;
> >> I/O below the ioprio threshold bypasses everything, even metadata and
> >> REQ_PRIO stuff. It was trivial, but I was able to spot and fix a tiny
> >> bypass accounting bug in the patch in the process): see
> >> http://www.esperi.org.uk/~nix/bundles/bcache-ioprio.bundle. (I figured
> >> you didn't want almost exactly the same patch series as before posted to
> >> the list, but I can do that if you prefer.)
> >
> > I compared this to my branch of the patches and cannot spot a
>
> Oh good someone else is using this, I'm not working on completely
> untrodden snow!
>
> > difference: Where's the tiny bypass accounting bug you fixed?
>
> The original patch I worked from had
>
> @@ -386,6 +388,28 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
>              op_is_write(bio_op(bio))))
>                 goto skip;
>
> +       /* If the ioprio already exists on the bio, use that.  We assume that
> +        * the upper layer properly assigned the calling process's ioprio to
> +        * the bio being passed to bcache. Otherwise, use current's ioc. */
> +       ioprio = bio_prio(bio);
> +       if (!ioprio_valid(ioprio)) {
> +               ioc = get_task_io_context(current, GFP_NOIO, NUMA_NO_NODE);
> +               if (ioc) {
> +                       if (ioprio_valid(ioc->ioprio))
> +                               ioprio = ioc->ioprio;
> +                       put_io_context(ioc);
> +                       ioc = NULL;
> +               }
> +       }
> +
> +       /* If process ioprio is lower-or-equal to dc->ioprio_bypass, then
> +        * hint for bypass. Note that a lower-priority IO class+value
> +        * has a greater numeric value. */
> +       if (ioprio_valid(ioprio) && ioprio_valid(dc->ioprio_writeback)
> +               && ioprio >= dc->ioprio_bypass) {
> +               return true;
> +       }
> +
>
> This is erroneous: bypassing should 'goto skip' in order to call
> bch_mark_sectors_bypassed(), not just return true.
>
> > Here's my branch:
> > https://github.com/kakra/linux/compare/master...kakra:rebase-4.20/bcache-updates
>
> Looks to be fixed there. Maybe you found a later version of the patches
> than I did :) I derived mine from ewheelerinc's
> for-4.10-block-bcache-updates, but even
> bcache-updates-linux-block-for-4.13 seems to have the same bug, as does
> bcache-updates-linux-block-for-next.
>
> Which branch did you rebase from? Maybe I should respin from the same
> one (or probably just use your branch :) ).

I used the same base but I'm carrying around those patches since then,
rebased through several kernel versions. I think Eric also jumped in
once a commented on some corrections that should be made. I just
followed what I was reading.

Feel free to use that branch, it also has some fixes that are queued for 5.1.

> > There's still a problem with bcache doing writebacks very very slowly,
> > at only 4k/s. My system generates more than 4k/s writes thus it will
> > eventually never finish writing back dirty data.
>
> That seems... very bad.

It can be. It has downsides: On a busy system, writeback should kick
in only when idle to not delay read IO.

> (Thankfully this doesn't affect me since I turned writeback off on the
> grounds that since this is all atop md/RAID-6, if I want writeback
> caching I'm going to do it by turning on the RAID journal anyway -- and
> so many of my writes are never read again except sequentially that
> storing them all would be a complete waste... and also because bcache
> writeback has always struck me as far buggier than the rest of it.)
>
> > This makes the writeback worker write at least 4 MB/s which should be
>
> 4MiB/s is... also exceedingly slow for spinning rust. I can manage at
> least 200MiB/s to this array, and this is RAID-6, which is notably slow
> at writing. Hell, my 1996-era sym53c875 could manage 10MiB/s!

The above note on "not delay read IO" is the key here: I didn't want
to delay foreground IO too much, 4 MB/s is at least a somewhat sane
writeback rate because it is way above the low average of activity on
my FS. This is just to guarantee that it will eventually finish in the
somewhat near future. It works well so far.

For optimally ordered IOs I've seen 800 MB/s here, but usually it
peaks at around 60-80 MB/s for writes when doing Steam downloads (tho
I'm not sure if Steam servers are the limiting factor here or my
disks, my internet line limits at 125 MB/s, Steam downloads sound
quite random, lot's of head movement), and 200-300 MB/s for reads.
Setup: bcache 400 GB SSD + 4x HDD btrfs RAID-0. General IO latency
seems better with vm.watermark_scale_factor=200 and
vm.vfs_cache_pressure=50. I'm also having one smaller XFS partition
backed with bcache in write-around mode for some constantly changing
data - that's mainly why I stumbled across this thread.

> It is fairly common for me to emit 5GiB of writes at once. I wouldn't
> want to wait hours for them to hit permanent storage!
>
> > Before this change, I've seen bcache size not fully used and a lot
>
> Mine is still only 8GiB used out of 340. I think I might boost the
> bypass figures -- perhaps setting it identical to the RAID stripe size
> was a bad idea? (Though I thought there was a preference for full-stripe
> *writes*, not reads, even if XFS does know about the RAID topology.)

I'm not sure if XFS could really discover the lower-layers topology
through bcache...

Regards,
Kai