Re: bcache on XFS: metadata I/O (dirent I/O?) not getting cached at all?

Nix <nix@xxxxxxxxxxxxx> · Wed, 13 Feb 2019 00:22:40 +0000




On 13 Feb 2019, Kai Krakow told this:

> Am Do., 7. Feb. 2019 um 21:51 Uhr schrieb Nix <nix@xxxxxxxxxxxxx>:
>> btw I have ported ewheeler's ioprio-based cache hinting patch to 4.20;
>> I/O below the ioprio threshold bypasses everything, even metadata and
>> REQ_PRIO stuff. It was trivial, but I was able to spot and fix a tiny
>> bypass accounting bug in the patch in the process): see
>> http://www.esperi.org.uk/~nix/bundles/bcache-ioprio.bundle. (I figured
>> you didn't want almost exactly the same patch series as before posted to
>> the list, but I can do that if you prefer.)
>
> I compared this to my branch of the patches and cannot spot a

Oh good someone else is using this, I'm not working on completely
untrodden snow!

> difference: Where's the tiny bypass accounting bug you fixed?

The original patch I worked from had

@@ -386,6 +388,28 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
 	     op_is_write(bio_op(bio))))
 		goto skip;
 
+	/* If the ioprio already exists on the bio, use that.  We assume that
+	 * the upper layer properly assigned the calling process's ioprio to
+	 * the bio being passed to bcache. Otherwise, use current's ioc. */
+	ioprio = bio_prio(bio);
+	if (!ioprio_valid(ioprio)) {
+		ioc = get_task_io_context(current, GFP_NOIO, NUMA_NO_NODE);
+		if (ioc) {
+			if (ioprio_valid(ioc->ioprio))
+				ioprio = ioc->ioprio;
+			put_io_context(ioc);
+			ioc = NULL;
+		}
+	}
+
+	/* If process ioprio is lower-or-equal to dc->ioprio_bypass, then
+	 * hint for bypass. Note that a lower-priority IO class+value
+	 * has a greater numeric value. */
+	if (ioprio_valid(ioprio) && ioprio_valid(dc->ioprio_writeback)
+		&& ioprio >= dc->ioprio_bypass) {
+		return true;
+	}
+

This is erroneous: bypassing should 'goto skip' in order to call
bch_mark_sectors_bypassed(), not just return true.

> Here's my branch:
> https://github.com/kakra/linux/compare/master...kakra:rebase-4.20/bcache-updates

Looks to be fixed there. Maybe you found a later version of the patches
than I did :) I derived mine from ewheelerinc's
for-4.10-block-bcache-updates, but even
bcache-updates-linux-block-for-4.13 seems to have the same bug, as does
bcache-updates-linux-block-for-next.

Which branch did you rebase from? Maybe I should respin from the same
one (or probably just use your branch :) ).

>> Semi-unrelated side note: after my most recent reboot, which involved a
>> bcache journal replay even though my shutdown was clean, the stats_total
>> reset; the cache device's bcache/written and
>> bcache/set/cache_available_percent also flipped to 0 and 100%,. I
>> suspect this is merely a stats bug of some sort, because the boot was
>> notably faster than before and cache_hits was about 6000 by the time it
>> was done. bcache/priority_stats *does* say that the cache is "only" 98%
>> unused, like it did before. Maybe cache_available_percent doesn't mean
>> what I thought it did.
>
> There's still a problem with bcache doing writebacks very very slowly,
> at only 4k/s. My system generates more than 4k/s writes thus it will
> eventually never finish writing back dirty data.

That seems... very bad.

(Thankfully this doesn't affect me since I turned writeback off on the
grounds that since this is all atop md/RAID-6, if I want writeback
caching I'm going to do it by turning on the RAID journal anyway -- and
so many of my writes are never read again except sequentially that
storing them all would be a complete waste... and also because bcache
writeback has always struck me as far buggier than the rest of it.)

> This makes the writeback worker write at least 4 MB/s which should be

4MiB/s is... also exceedingly slow for spinning rust. I can manage at
least 200MiB/s to this array, and this is RAID-6, which is notably slow
at writing. Hell, my 1996-era sym53c875 could manage 10MiB/s!

It is fairly common for me to emit 5GiB of writes at once. I wouldn't
want to wait hours for them to hit permanent storage!

> Before this change, I've seen bcache size not fully used and a lot

Mine is still only 8GiB used out of 340. I think I might boost the
bypass figures -- perhaps setting it identical to the RAID stripe size
was a bad idea? (Though I thought there was a preference for full-stripe
*writes*, not reads, even if XFS does know about the RAID topology.)

(My storage hierarchy:

xfs -> [optional dm-crypt] -> LVM -> writearound bcache on an LVM PV
-> md/raid6.

XFS is journalling to the same SSD I'm using for bcaching. In theory
this is dangerous: in practice my SSD still shows nearly a thousand
years expected lifespan! So I'm not really considering SSD failure to be
likely, and will replace it with another similarly costly-but-functional
Intel DC SSD before failure in any case.

Figuring out the data offsets etc for the various layers of this
hierarchy, almost none of which are aware of the offsets introduced by
the higher layers and all of which use totally different ways to
communicate or adjust the offset, was fairly horrible, but blktrace
shows that I got it right in the end.)


-- 
NULL && (void)