Re: Sysfs-Configurable readahead and background bypasses

Kai Krakow <kai@xxxxxxxxxxx> · Sun, 17 Feb 2019 14:12:44 +0100

Hello!

Am So., 17. Feb. 2019 um 06:41 Uhr schrieb Coly Li <colyli@xxxxxxx>:
>
> On 2019/2/16 9:28 下午, Andreas wrote:
> > Thank you, I understand the situation a little better now.
> >
> > Saving cache space makes sense for cache drives that, as you said, are
> > small. But for users like me, who are going the extra mile and install a
> > generously large cache drive the behaviour is punishing.
> > After upgrading my kernel and swapping out the cache drive, I was having
> > trouble getting my new 128GB cache filled from three 8TB hard drives,
> > which set me on my journey to figure out why and ended up writing my
> > patch. I also know of people using SSDs as large as 512GB exclusively
> > for bcache.
> >
> > The symptom that made me curious about there being an odd change in
> > bcache behaviour was my MP3 music library, where my file browser reads
> > the ID3-tag information from these files. No matter how often I scrolled
> > through my library, most of the traffic kept going to the hard drive and
> > bcache wasn't adding any new data to the cache drive despite there being
> > upwards of 100GB of unused cache space.
> > As it turned out, my file explorer first issues a small read to each
> > file to determine the size and position of the ID3-tag section. The
> > readahead operation attached to this small read would then fetch the
> > actual ID3-tag and the subsequent read for the tag data would not issue
> > a seperate operation to be considered by bcache. This is then done for
> > several files simultaneously - a workload an SSD can happily deal with
> > but a HDD gets overwhelmed by.
> > Bcache only cached that first small read for each file and ignored the
> > actual ID3-tag data as it was fetched from a readahead. This behaviour
> > was consistent in that even in subsequent iterations of the scenario
> > only that first small read was served from the cache and then the HDD
> > had to slowly seek to the actual ID3-tag data without bcache ever
> > picking up on it as it was still being fetched by a readahead.
> > So while in theory it might sound fine to rely on readaheads to the HDD,
> > in practice it is noticeably faster to have everything coming from the
> > SSD cache.
> >
>
> Hi Andreas,
>
> Thanks for your patience and explanation. I come to understand your use
> case, it is reasonable to have such readahead data on cache device.
>
> > I believe that one of the core problems with this behaviour is that
> > bcache simply doesn't know if data fetched in a readahead is actually
> > being used or not. Caching readaheads leads to false positives (data
> > cached that isn't being used) and bypassing readaheads leads to false
> > negatives (data not cached that is being used) - in my eyes it should be
> > up to the user to decide which way works better for them if they want to.
> >
> > To me, bypassing readahead and background IO only seems like a good idea
> > for relatively small caches (I'd say <= 16GB). But users with bigger
> > caches get punished by this behaviour as they could get better
> > performance out of it (and have been until late 2017).
> >
> > Beside this anecdotal evidence and thought I cannot provide any hard
> > numbers on the issue.
> >
>
> Let me explain why a performance number is desired. Normally most of
> readahead pages only being accessed once, so it is sufficient to only
> keep them in memory for once. It is worthy to keep the readahead data in
> cache device only when the data will be accessed multiple times (hot).
> Otherwise bcache just introduces more I/Os on SSD, does not help much on
> performance.

Here's a suggested solution that could work and improve hit rate
especially for devices which are too small for the applied workload:

In the LRU algorithm, never insert new cache entries at the tip of the
list but only at a random location in the LRU list. Only move the
entry to the tip of the LRU list when it is accessed. This way, data
only accessed once has a better chance of being flushed from the cache
early on average. It shouldn't impact big caches. So a good
performance test would be to create a workload which exceeds the size
of a small cache, and then run reproducer tests to create some
performance numbers (hit rate, throughput, latency, time to run etc.).

The obvious downside is that we may push some IO out of the cache too
early that happens to be accessed a second time just a few requests
after this event. But the random factor should filter this problem
out, so me may hit a false negative only once or twice before it stays
in the cache.

I was already thinking about creating such a patch but I really do not
understand the LRU functions very well... It seems they do not easily
support an "insert_at" operate, only a "random discard" - but latter
is not the same as random discard. In fact, I think random discard is
quite useless, especially once a random insert would be in the code.
Random discard doesn't know the history of the entry discarded, while
random insert does.

What do you think?

Thanks,
Kai