Re: Sysfs-Configurable readahead and background bypasses

Kai Krakow <kai@xxxxxxxxxxx> · Sun, 17 Feb 2019 14:26:05 +0100

Correcting...

Am So., 17. Feb. 2019 um 14:12 Uhr schrieb Kai Krakow <kai@xxxxxxxxxxx>:
>
> Hello!
>
> Am So., 17. Feb. 2019 um 06:41 Uhr schrieb Coly Li <colyli@xxxxxxx>:
> >
> > On 2019/2/16 9:28 下午, Andreas wrote:
> > > Thank you, I understand the situation a little better now.
> > >
> > > Saving cache space makes sense for cache drives that, as you said, are
> > > small. But for users like me, who are going the extra mile and install a
> > > generously large cache drive the behaviour is punishing.
> > > After upgrading my kernel and swapping out the cache drive, I was having
> > > trouble getting my new 128GB cache filled from three 8TB hard drives,
> > > which set me on my journey to figure out why and ended up writing my
> > > patch. I also know of people using SSDs as large as 512GB exclusively
> > > for bcache.
> > >
> > > The symptom that made me curious about there being an odd change in
> > > bcache behaviour was my MP3 music library, where my file browser reads
> > > the ID3-tag information from these files. No matter how often I scrolled
> > > through my library, most of the traffic kept going to the hard drive and
> > > bcache wasn't adding any new data to the cache drive despite there being
> > > upwards of 100GB of unused cache space.
> > > As it turned out, my file explorer first issues a small read to each
> > > file to determine the size and position of the ID3-tag section. The
> > > readahead operation attached to this small read would then fetch the
> > > actual ID3-tag and the subsequent read for the tag data would not issue
> > > a seperate operation to be considered by bcache. This is then done for
> > > several files simultaneously - a workload an SSD can happily deal with
> > > but a HDD gets overwhelmed by.
> > > Bcache only cached that first small read for each file and ignored the
> > > actual ID3-tag data as it was fetched from a readahead. This behaviour
> > > was consistent in that even in subsequent iterations of the scenario
> > > only that first small read was served from the cache and then the HDD
> > > had to slowly seek to the actual ID3-tag data without bcache ever
> > > picking up on it as it was still being fetched by a readahead.
> > > So while in theory it might sound fine to rely on readaheads to the HDD,
> > > in practice it is noticeably faster to have everything coming from the
> > > SSD cache.
> > >
> >
> > Hi Andreas,
> >
> > Thanks for your patience and explanation. I come to understand your use
> > case, it is reasonable to have such readahead data on cache device.
> >
> > > I believe that one of the core problems with this behaviour is that
> > > bcache simply doesn't know if data fetched in a readahead is actually
> > > being used or not. Caching readaheads leads to false positives (data
> > > cached that isn't being used) and bypassing readaheads leads to false
> > > negatives (data not cached that is being used) - in my eyes it should be
> > > up to the user to decide which way works better for them if they want to.
> > >
> > > To me, bypassing readahead and background IO only seems like a good idea
> > > for relatively small caches (I'd say <= 16GB). But users with bigger
> > > caches get punished by this behaviour as they could get better
> > > performance out of it (and have been until late 2017).
> > >
> > > Beside this anecdotal evidence and thought I cannot provide any hard
> > > numbers on the issue.
> > >
> >
> > Let me explain why a performance number is desired. Normally most of
> > readahead pages only being accessed once, so it is sufficient to only
> > keep them in memory for once. It is worthy to keep the readahead data in
> > cache device only when the data will be accessed multiple times (hot).
> > Otherwise bcache just introduces more I/Os on SSD, does not help much on
> > performance.
>
> Here's a suggested solution that could work and improve hit rate
> especially for devices which are too small for the applied workload:
>
> In the LRU algorithm, never insert new cache entries at the tip of the
> list but only at a random location in the LRU list. Only move the
> entry to the tip of the LRU list when it is accessed. This way, data
> only accessed once has a better chance of being flushed from the cache
> early on average. It shouldn't impact big caches. So a good
> performance test would be to create a workload which exceeds the size
> of a small cache, and then run reproducer tests to create some
> performance numbers (hit rate, throughput, latency, time to run etc.).
>
> The obvious downside is that we may push some IO out of the cache too
> early that happens to be accessed a second time just a few requests
> after this event. But the random factor should filter this problem
> out, so me may hit a false negative only once or twice before it stays
> in the cache.
>
> I was already thinking about creating such a patch but I really do not
> understand the LRU functions very well... It seems they do not easily
> support an "insert_at" operate, only a "random discard" - but latter
> is not the same as random discard. In fact, I think random discard is

...is not the same as random INSERT...

> quite useless, especially once a random insert would be in the code.
> Random discard doesn't know the history of the entry discarded, while
> random insert does.
>
> What do you think?
>
>
> Thanks,
> Kai