Cache Pool writing too much on ssds, poor performance?

mark.nelson@xxxxxxxxxxx (Mark Nelson) · Thu, 11 Sep 2014 09:02:40 -0500

Something that is very important to keep in mind with the way that the 
cache tier implementation currently works in Ceph is that cache misses 
are very expensive.  It's really important that your workload have a 
really big hot/cold data skew otherwise it's not going to work well at 
all.  In your case, you are doing sequential reads which is terrible for 
this because for each pass you are never re-reading the same blocks, and 
by the time you get to the end of the test and restart it, the first 
blocks (apparently) have already been flushed.  If you increased the 
size of the cache tier, you might be able to fit the whole thing in 
cache which should help dramatically, but that's not easy to guarantee 
outside of benchmarks.

I'm guessing you are using firefly right?  To improve this behaviour, 
Sage implemented a new policy in the recent development releases not to 
promote reads right away.  Instead, we wait to promote until there are 
several reads of the same object within a certain time period.  That 
should dramatically help in this case.  You really don't want big 
sequential reads being promoted into cache since cache promotion is 
expensive and the disks are really good at doing that kind of thing anyway.

On the flip side, 4MB read misses are bad, but the situation is even 
worse with 4K misses. Imagine for instance that you are going to do a 4K 
read from a default 4MB RBD block and that object is not in cache.  In 
the implementation we have in firefly, the whole 4MB object will be 
promoted to the cache which will in most cases require a transfer of 
that object over the network to the primary OSD for the associated PG in 
the cache pool.  Now depending on the replication policy, that primary 
cache pool OSD will fan out and write (by default) 2 extra copies of the 
data to the other OSDs in the PG, so 3 total.  Now assuming your cache 
tier is on SSDs with co-located journals, each one of those writes will 
actually be 2 writes, one to the journal, and one to the data store.

To recap: *Any* read miss regardless if it's 4KB or 4MB means at least 1 
4MB object promotion, times 3 replicas (ie 12MB over the network) times 
2 for the journal writes. So 24MB of data written to the cache tier 
disks, no matter if it's 4KB or 4MB.  Imagine you have 200 IOPS worth of 
4KB read cache misses.  That's roughly 4.8GB/s of writes into the cache 
tier.  If you are seldomly re-reading the same blocks, performance will 
be absolutely terrible.  On the other hand, if you have lots of small 
random reads from the same set of 4MB objects, the cache tier really can 
help.  How much it helps vs just doing the reads from page cache is 
debatable though.  There's some band between page cache and disk where 
the cache tier fits in, but getting everything right is going to be tricky.

The optimal situation imho is that the cache tier only promote objects 
that have a lot of small random reads hitting them, and be very 
conservative about promotions, especially for new writes.  I don't know 
whether or not cache promotion might pollute page cache in strange ways, 
but that's something we also may need to be careful about.

For more info, see the following thread:

http://www.spinics.net/lists/ceph-devel/msg20189.html

Mark

On 09/10/2014 07:51 AM, Andrei Mikhailovsky wrote:
> Hello guys,
>
> I am experimeting with cache pool and running some tests to see how
> adding the cache pool improves the overall performance of our small cluster.
>
> While doing testing I've noticed that it seems that the cache pool is
> writing too much on the cache pool ssds. Not sure what the issue here,
> perhaps someone could help me understand what is going on.
>
> My test cluster is:
> 2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds
> journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of
> 25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2.
> 4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network
> capable of 12gbit/s over ipoib)
>
> So, my test is:
> Simple tests using the following command: "dd if=/dev/vda of=/dev/null
> bs=4M count=2000 iflag=direct". I am concurrently starting this command
> on 10 virtual machines which are running on 4 host servers. The aim is
> to monitor the use of cache pool when reading the same data over and
> over again.
>
>
> Running the above command for the first time does what I was expecting.
> The osds are doing a lot of reads, the cache pool does a lot of writes
> (around 250-300MB/s per ssd disk) and no reads. The dd results for the
> guest vms are poor. The results of the "ceph -w" shows consistent
> performance across the time.
>
> Running the above for the second and consequent times produces IO
> patterns which I was not expecting at all. The hdd osds are not doing
> much (this part I expected), the cache pool still does a lot of writes
> and very little reads! The dd results have improved just a little, but
> not much. The results of the "ceph -w" shows performance breaks over
> time. For instance, I have a peak of throughput in the first couple of
> seconds (data is probably coming from the osd server's ram at high
> rate). After the peak throughput has finished, the ceph reads are done
> in the following way: 2-3 seconds of activity followed by 2 seconds if
> inactivity) and it keeps doing that throughout the length of the test.
> So, to put the numbers in perspective, when running tests over and over
> again I would get around 2000 - 3000MB/s for the first two seconds,
> followed by 0MB/s for the next two seconds, followed by around
> 150-250MB/s over 2-3 seconds, followed by 0MB/s for 2 seconds, followed
> 150-250MB/s over 2-3 seconds, followed by 0MB/s over 2 secods, and the
> pattern repeats until the test is done.
>
>
> I kept running the dd command for about 15-20 times and observed the
> same behariour. The cache pool does mainly writes (around 200MB/s per
> ssd) when guest vms are reading the same data over and over again. There
> is very little read IO (around 20-40MB/s). Why am I not getting high
> read IO? I have expected the 80GB of data that is being read from the
> vms over and over again to be firmly recognised as the hot data and kept
> in the cache pool and read from it when guest vms request the data.
> Instead, I mainly get writes on the cache pool ssds and I am not really
> sure where these writes are coming from as my hdd osds are being pretty
> idle.
>
>  From the overall tests so far, introducing the cache pool has
> drastically slowed down my cluster (by as much as 50-60%).
>
> Thanks for any help
>
> Andrei
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>