Re: [GIT] Bcache version 12

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Tue, 20 Sep 2011 20:44:16 -0700

On Tue, Sep 20, 2011 at 05:37:05PM +0200, Arnd Bergmann wrote:
> On Saturday 10 September 2011, Kent Overstreet wrote:
> > Short overview:
> > Bcache does both writethrough and writeback caching. It presents itself
> > as a new block device, a bit like say md. You can cache an arbitrary
> > number of block devices with a single cache device, and attach and
> > detach things at runtime - it's quite flexible.
> > 
> > It's very fast. It uses a b+ tree for the index, along with a journal to
> > coalesce index updates, and a bunch of other cool tricks like auxiliary
> > binary search trees with software floating point keys to avoid a bunch
> > of random memory accesses when doing binary searches in the btree. It
> > does over 50k iops doing 4k random writes without breaking a sweat,
> > and would do many times that if I had faster hardware.
> > 
> > It (configurably) tracks and skips sequential IO, so as to efficiently
> > cache random IO. It's got more cool features than I can remember at this
> > point. It's resilient, handling IO errors from the SSD when possible up
> > to a configurable threshhold, then detaches the cache from the backing
> > device even while you're still using it.
> 
> Hi Kent,
> 
> What kind of SSD hardware do you target here? I roughly categorize them
> into two classes, the low-end (USB, SDHC, CF, cheap ATA SSD) and the
> high-end (SAS, PCIe, NAS, expensive ATA SSD), which have extremely
> different characteristics. 

All of the above.

> I'm mainly interested in the first category, and a brief look at your
> code suggests that this is what you are indeed targetting. If that is
> true, can you name the specific hardware characteristics you require
> as a minimum? I.e. what erase block (bucket) sizes do you support
> (maximum size, non-power-of-two), how many buckets do you have
> open at the same time, and do you guarantee that each bucket is written
> in consecutive order?

Bucket size is set when you format your cache device. It is restricted
to powers of two (though the only reason for that restriction is to
avoid dividing by bucket size all over the place; if there was a
legitimate need we could easily see what the performance hit would be).

And it has to be >= PAGE_SIZE; come to think of it I don't think there's
a hard upper bound. Performance should be reasonable for bucket sizes
anywhere between 64k and around 2 mb; somewhere around 64k your btree
will have a depth of 2 and that and the increased operations on non leaf
nodes are going to hurt performance. Above around 2 mb and performance
will start to drop as btree nodes get bigger, but the hit won't be
enormous.

For data buckets, we currently keep 16 open, 8 for clean data and 8 for
dirty data. That's hard coded, but there's no reason it has to be. Btree
nodes are in normal operation mostly not full and thus could be
considered open buckets - it's always one btree node per bucket. IO to
the btree is typically < 1% of total IO, though.

Most metadata IO is to the journal; the journal uses a list of buckets
and writes to them all sequentially, so one open bucket for the journal.

The one exception is the superblock, but that doesn't get written to in
normal operation. I am eventually going to switch to using another
journal for the superblock, as part of bcache FTL.

We do guarantee that buckets are allways written to sequentially (save
the superblock). If discards are on, bcache will always issue a discard
before it starts writing to a bucket again (except for the journal, that
part's unfinished).

> On a different note, we had discussed at the last storage/fs summit about
> using an SSD cache either without a backing store or having the backing
> store on the same drive as the cache in order to optimize traditional
> file system on low-end flash media. Have you considered these scenarios?
> How hard would it be to support this in a meaningful way? My hope is that
> by sacrificing some 10% of the drive size, you would get significantly
> improved performance because you can avoid many internal GC cycles within
> the drive.

Yeah, so... what you really want there is to move the FTL into the
kernel, so you can have an FTL that doesn't suck. Bcache is already
about 90% of the way to being a full blown high performance FTL...

Besides the metadata stuff that I sort of covered above, the other thing
that'd have to be done to use bcache as an FTL and not a cache is we'd
just need a moving garbage collector - so when a bucket is mostly
empty but has data we need to keep we can move it somewhere else. But
this is pretty similar to what background writeback does now, so it'll
be easy and straightforward.

So yeah, it can be done :)

Further off, what I really want to do is extend bcache somewhat to turn
it into the bottom half of a filesystem...

It sounds kind of crazy at first, but - well, bcache already has an
index, allocation and garbage collection. Right now the index is
device:sector -> cache device:phys sector. If we just switch to
inode:sector -> ... we can map files in the filesystem with the exact
same index we're using for the cache. Not just the same code, the same
index.

Then the rough plan is that layer above bcache - the filesystem proper -
will store all the inodes in a file (say file 0); then when bcache is
doing garbage collection it has to be able to ask the fs "How big is
file n supposed to be?". It gives us a very nice separation between
layers.

There's a ton of other details - bcache then needs to handle allocation
for rotating disks and not just SSDs, and you want to do that somewhat
differently as fragmentation matters. But the idea seems to have legs.

Also, bcache is gaining some nifty functionality that'd be nice to have
available in the filesystem proper - we're working on full data
checksumming right now, in particular. We might be able to pull off all
the features of ZFS and then some, and beat ZFS on performance (maybe
even with a smaller codebase!).

If you don't think I'm completely insane and want to hear more, let me
know :)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html