Cache Tiering Performance Ideas

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Wed, 27 Aug 2014 13:20:15 -0500

Hi All,

Earlier today I had a great conversation with some of the Gluster 
developers about cache tiering.  They want to implement something 
similar to what we've done and wanted to know what kinds of performance 
problems we've run into and brainstorm ideas to avoid similar issues for 
Gluster.

One of the big problems we've had occurs when using RBD, default 4MB 
block sizes, a cache pool, and 4k reads.  A single 4K read miss will 
currently cause a full-object promotion into the cache.  When you factor 
that journals will also receive a copy of the data and that you will 
want some level of replication (in our case 3x), that actually results 
in 24MB of data to be written the the cache pool. (With 12MB of it 
happening over the network!)

In Gluster they will be caching files rather than objects, and that is 
both good and bad.  A 40GB file promotion is going to be extremely 
expensive, so they will want to be very careful about accounting for the 
size of the files when making promotion decisions.  That will make it 
very tough for them to balance promoting large files when small IO is 
happening agains them.  They have an advantage though that file metadata 
is stored on the same server that makes the promotion decision.  They 
can use things like the file name (higher/lower promotion thresholds 
based on file type) and potentially the file size (except for initial 
writes), to influence when things go to cache.

In Ceph, with something like RBD, I don't think we can easily use file 
information to improve cache tier behaviour.  We may be able to do 
something else.  I wonder if perhaps at the RBD level, we could inspect 
the kind of writes being made to blocks and potentially whether or not 
that write is part of larger sequential write stream.  If so, set a flag 
that would persist with those objects indicating that these objects may 
be part of a large file.  The idea being that the objects are more 
likely to be read back sequentially where we can use read ahead and 
writing to the cache has more disadvantages than advantages.

General Assumptions:

1) Large writes and reads should come from the base pool rather than 
cache.  Big promotions to the cache tier are expensive (network 
consumption, write amplification) and spinning disks are already good at 
doing this kind of thing.

2) Writes to a full cache tier causes other hot or semi-hot data to be 
evicted.  For new writes, even if they are smallish, it might not be 
worth writing to the cache tier if it's full.

3) The best thing the cache tier can provide for us is caching small 
objects, or larger objects with small IO being performed against them. 
For larger objects, the cost of promotion is more expensive than smaller 
objects.

Questions:

1) If RBD is seeing a stream of large writes to consecutive blocks, 
should we set a persistent flag for those objects so that the promotion 
threshold is higher than normal?  The assumption being that until we see 
random small reads/writes being made to them (when we can unset the 
flag), the reads are assumed to also be large.

2) If RBD reads/writes are smaller than some threshold and the cache 
isn't full, should we just promote to cache?  If the cache is full, 
should we be more selective?  Should the threshold be different for 
promotions for reads vs initial writes?

3) Do we have other data available that we can use to guess when a 
promotion won't provide a lot of benefit?

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html