Re: Bcache version 9

Cédric Villemain <cedric.villemain.debian@xxxxxxxxx> · Wed, 24 Nov 2010 00:35:44 +0100

First I am really happy to see this project appearing here.

2010/11/21 Kent Overstreet <kent.overstreet@xxxxxxxxx>:
> Bcache is a patch to use SSDs to transparently cache arbitrary block
> devices. Its main claim to fame is that it's designed for the
> performance characteristics of SSDs - it avoids random writes and
> extraneous IO at all costs, instead allocating buckets sized to your
> erase blocks and filling them up seqentially. It uses a hybrid
> btree/log, instead of a hash table as some other caches.

Is it its main diff with flashcache ?
https://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt

>
> It does both writethrough and writeback caching - it can use most of
> your SSD for buffering random writes, which are then flushed
> sequentially to the backing device. Skips sequential IO, too.
>
> Current status:
> Recovering from unclean shutdown has been the main focus, and is now
> working magnificantly - I'm having no luck breaking it. This version
> looks to be plenty safe enough for beta testing (still, make backups).
>
> Proper discard support is in and enabled by default; bcache won't ever
> write to the same location twice without issuing a discard to that
> bucket.

 Is it relative to Torn Page possible issue outline by flashcache devel ?

>On my test box with a Corsair Nova, I'm seeing around a 30% hit
> in mysql performance with it on - there might be a bit of room for
> improvement, but I'm also curious of other drives do better. Even with
> that hit it's well worth it though, the performance degradation over
> time on this drive without TRIM is massive.
>
> The sysfs stuff has all been moved around and should be a little more
> standard now; the few files that aren't specific to a device
> (register_cache, register_dev) could use a better location - any
> suggestions?
>
> The btree cache has been rewritten and simplified, should exhibit less
> memory pressure than the old code.
>
> The initial implementation of incremental garbage collection is done -
> this version doesn't yet normally gc incrementally, as it was needed to
> handle allocation failure without deadlocking while ordering writes
> correctly. But finishing it is only a bit more work and will give much
> better worst case latency and slightly better cache utilization.
>
> Bcache is available from
> git://evilpiepirate.org/~kent/linux-bcache.git
> git://evilpiepirate.org/~kent/bcache-tools.git
>
> And the (somewhat outdated) wiki is
> http://bcache.evilpiepirate.org
>
> diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt
> new file mode 100644
> index 0000000..fc0ebac
> --- /dev/null
> +++ b/Documentation/bcache.txt
> @@ -0,0 +1,170 @@
> +Say you've got a big slow raid 6, and an X-25E or three. Wouldn't it be
> +nice if you could use them as cache... Hence bcache.
> +
> +Userspace tools and a wiki are at:
> +  git://evilpiepirate.org/~kent/bcache-tools.git
> +  http://bcache.evilpiepirate.org
> +
> +It's designed around the performance characteristics of SSDs - it only allocates
> +in erase block sized buckets, and it uses a hybrid btree/log to track cached
> +extants (which can be anywhere from a single sector to the bucket size). It's
> +designed to avoid random writes at all costs; it fills up an erase block
> +sequentially, then issues a discard before reusing it.
> +
> +Caching can be transparently enabled and disabled on arbitrary block devices
> +while they're in use. A caches stores the UUIDs of the devices it is caching,
> +allowing caches to safely persist across reboots. There's currently a hard
> +limit of 256 backing devices per cache.
> +
> +Both writethrough and writeback caching are supported. Writeback defaults to
> +off, but can be switched on and off arbitrarily at runtime. Bcache goes to
> +great lengths to order all writes to the cache so that the cache is always in a
> +consistent state on disk, and it never returns writes as completed until all
> +necessary data and metadata writes are completed. It's designed to safely
> +tolerate unclean shutdown without loss of data.
> +
> +Writeback caching can use most of the cache for buffering writes - writing
> +dirty data to the backing device is always done sequentially, scanning from the
> +start to the end of the index.
> +
> +Since random IO is what SSDs excel at, there generally won't be much benefit
> +to caching large sequential IO. Bcache detects sequential IO and skips it;
> +it also keeps a rolling average of the IO sizes per task, and as long as the
> +average is above the cutoff it will skip all IO from that task - instead of
> +caching the first 512k after every seek. Backups and large file copies should
> +thus entirely bypass the cache.
> +
> +In the event of an IO error or an inconsistency is detected, caching is
> +automatically disabled; if dirty data was present in the cache it first
> +disables writeback caching and waits for all dirty data to be flushed.
> +
> +All configuration is done via sysfs. To use sde to cache md1, assuming the
> +SSD's erase block size is 128k:
> +
> +  make-bcache -b128k /dev/sde
> +  echo "/dev/sde" > /sys/kernel/bcache/register_cache
> +  echo "<UUID> /dev/md1" > /sys/kernel/bcache/register_dev
> +
> +More suitable for scripting might be
> +  echo "`blkid /dev/md1 -s UUID -o value` /dev/md1" \
> +         > /sys/kernel/bcache/register_dev
> +
> +Then, to enable writeback:
> +
> +  echo 1 > /sys/block/md1/bcache/writeback
> +
> +Other sysfs files for the backing device:
> +
> +  bypassed
> +    Sum of all IO, reads and writes, than have bypassed the cache
> +
> +  cache_hits
> +  cache_misses
> +  cache_hit_ratio
> +    Hits and misses are counted per individual IO as bcache sees them; a
> +    partial hit is counted as a miss.
> +
> +  clear_stats
> +    Writing to this file resets all the statistics
> +
> +  flush_delay_ms
> +  flush_delay_ms_sync
> +    Optional delay for btree writes to allow for more coalescing of updates to
> +    the index. Default to 10 ms for normal writes and 0 for sync writes.
> +
> +  sequential_cutoff
> +    A sequential IO will bypass the cache once it passes this threshhold; the
> +    most recent 128 IOs are tracked so sequential IO can be detected even when
> +    it isn't all done at once.
> +
> +  unregister
> +    Writing to this file disables caching on that device
> +
> +  writeback
> +    Boolean, if off only writethrough caching is done
> +
> +  writeback_delay
> +    When dirty data is written to the cache and it previously did not contain
> +    any, waits some number of seconds before initiating writeback. Defaults to
> +    30.
> +
> +  writeback_percent
> +    To allow for more buffering of random writes, writeback only proceeds when
> +    more than this percentage of the cache is unavailable. Defaults to 0.
> +
> +  writeback_running
> +    If off, writeback of dirty data will not take place at all. Dirty data will
> +    still be added to the cache until it is mostly full; only meant for
> +    benchmarking. Defaults to on.
> +
> +For the cache:
> +  btree_avg_keys_written
> +    Average number of keys per write to the btree when a node wasn't being
> +    rewritten - indicates how much coalescing is taking place.
> +
> +  btree_cache_size
> +    Number of btree buckets currently cached in memory
> +
> +  btree_written
> +    Sum of all btree writes, in (kilo/mega/giga) bytes
> +
> +  clear_stats
> +    Clears the statistics associated with this cache
> +
> +  discard
> +    Boolean; if on a discard/TRIM will be issued to each bucket before it is
> +    reused. Defaults to on if supported.
> +
> +  heap_size
> +    Number of buckets that are available for reuse (aren't used by the btree or
> +    dirty data)
> +
> +  nbuckets
> +    Total buckets in this cache
> +
> +  synchronous
> +    Boolean; when on all writes to the cache are strictly ordered such that it
> +    can recover from unclean shutdown. If off it will not generally wait for
> +    writes to complete, but the entire cache contents will be invalidated on
> +    unclean shutdown. Not recommended that it be turned off when writeback is
> +    on.
> +
> +  unregister
> +    Closes the cache device and all devices being cached; if dirty data is
> +    present it will disable writeback caching and wait for it to be flushed.
> +
> +  written
> +    Sum of all data that has been written to the cache; comparison with
> +    btree_written gives the amount of write inflation in bcache.
> +
> +To script the UUID lookup, you could do something like:
> +  echo "`blkid /dev/md1 -s UUID -o value` /dev/md1"\
> +         > /sys/kernel/bcache/register_dev
> +
> +Caveats:
> +
> +Bcache appears to be quite stable and reliable at this point, but there are a
> +number of potential issues.
> +
> +The ordering requirement of barriers is silently ignored; for ext4 (and
> +possibly other filesystems) you must explicitly mount with -o nobarrier or you
> +risk severe filesystem corruption in the event of unclean shutdown.
> +
> +A change to the generic block layer for ad hoc bio splitting can potentially
> +break other things; if a bio is used without calling bio_init() or bio_endio()
> +is called more than once, the kernel will BUG(). Ext4, raid1, raid10 and lvm
> +work fine for me; raid5/6 and I'm told btrfs are not.
> +
> +
> +Caching partitions doesn't do anything (though using them as caches works just
> +fine). Using the whole device instead works.
> +
> +Nothing is done to prevent the use of a backing device without the cache it has
> +been used with, when the cache contains dirty data; if you do, terribly things
> +will happen.
> +
> +Furthermore, if the cache didn't have any dirty data and you mount the backing
> +device without the cache, you've now made the cache contents stale and they
> +need to be manually invalidated. For now the only way to do that is rerun
> +make-bcache. The solution to both issues will be the introduction of a bcache
> +specific container format for the backing device, which will come at some point
> +in the future along with thin provisioning and volume management.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/ ;    PostgreSQL : Expertise, Formation et Support
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html