EC Pool and Cache Tier Tuning

Nick Fisk <nick@xxxxxxxxxx> · Sat, 7 Mar 2015 15:54:11 -0000

Hi All,

I have been experimenting with EC pools and Cache Tiers to make them more
useful for more active data sets on RBD volumes and I thought I would share
my findings so far as they have made quite a significant difference.

My Ceph cluster comprises of 4 Nodes each with the following:-
12x2.1ghx Xeon cores
32GB Ram
2x 10GB Networking ALB bonded
10x 3TB WD Red Pro disks - EC pool k=3 m=3 (7200rpm)
2x S3700 100GB SSD's (20k Write IOPs) for HDD Journals
1x S3700 400GB SSD (35k Write IOPs) for cache tier - 3x replica 

One thing I noticed with default settings is that when encountering a cache
miss performance dropped significantly, to a level far below that of the
7200rpm disks. I also noticed that despite having around 500GB of cache tier
available, I was getting a lot more misses than I would expect for the
amount of data I had on the pool.

I had a theory that the default RBD block size of 4MB was the cause of both
these anomalies. To put this to the test I created a test EC pool and cache
pool and did some testing with different RBD order values.

First though, I performed some basic benchmarking on one of the 7200RPM
disks and came up with the following (all values at IO depth of 1):-

IO Size		IOPs
4MB		25
1MB		52
256KB		73
64KB		81
4KB		83

As you can see random IO performance really starts to drop off once you go
above about 256kb IO size and below that there is diminishing returns as
bandwidth drops of dramatically with lower IO sizes.

When using an EC pool each object is split into shards and stored on each
disk. I'm using a k=3 EC pool, so that will mean a 4MB object will be split
into about 1.3MB shards,  which as can be see above is really not the best
size of IO for random performance.

I then created 4 RBD's with the following object sizes 4MB,2MB,1MB and
512KB, filled them with data, evicted the cache pool and then using Fio,
performed random read 64kb IO's. Results as follows:-

RBD Obj Size	Shard Size	64k IOPs
4MB		1.33MB		24
2MB		0.66MB		38
1MB		.33MB		51
512K		.17MB		58

As can be seen there is a doubling of random IO performance between 4MB and
1MB object sizes and looking at the shard sizes, this correlates quite
nicely with the disk benchmarks I did earlier. Going to a 512kb Object size,
does improve performance but it is starting to tail off.

The other benefit of using a smaller object size seems to be that the cache
tier is much more effective at caching hot blocks as a single IO
promotes/evicts less data. I don't have any numbers on this yet, but I will
try and get Fio to create a hot spot pattern so I can generate some reliable
figures. But from just using the RBD's it certainly feels like the cache is
doing a much better job with 1MB object sizes.

Incidentally I looked at some RAID 6 write benchmarks with varying chunk
sizes as when doing a write you have to read the whole stripe back. Most of
these benchmarks also show performance dropping off past 256/512kb chunk
sizes.

The next thing I tried was to change the read_expire parameter of the
deadline scheduler to 30ms to make sure that reads are prioritised even more
than default. Again I don't have numbers for this yet, but watching iostat
seems to show that reads are happening with a much more predictable latency.
Delaying the writes should not be such a problem as the journals buffer
them.

To summarize, trying to get your EC shards to be around 256kb seems to
improve random IO at the cost of some bandwidth. If your EC pool has data
that is rarely accessed or only does large block IOs then the default size
probably won't have much of an impact. There is also the fact that you will
now have 4 times more objects, I'm not sure what the impact of this is,
maybe someone can comment?

If anyone notices that I have made any mistakes in my tests of assumptions,
please let me know.

Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com