Hi All, I have been experimenting with EC pools and Cache Tiers to make them more useful for more active data sets on RBD volumes and I thought I would share my findings so far as they have made quite a significant difference. My Ceph cluster comprises of 4 Nodes each with the following:- 12x2.1ghx Xeon cores 32GB Ram 2x 10GB Networking ALB bonded 10x 3TB WD Red Pro disks - EC pool k=3 m=3 (7200rpm) 2x S3700 100GB SSD's (20k Write IOPs) for HDD Journals 1x S3700 400GB SSD (35k Write IOPs) for cache tier - 3x replica One thing I noticed with default settings is that when encountering a cache miss performance dropped significantly, to a level far below that of the 7200rpm disks. I also noticed that despite having around 500GB of cache tier available, I was getting a lot more misses than I would expect for the amount of data I had on the pool. I had a theory that the default RBD block size of 4MB was the cause of both these anomalies. To put this to the test I created a test EC pool and cache pool and did some testing with different RBD order values. First though, I performed some basic benchmarking on one of the 7200RPM disks and came up with the following (all values at IO depth of 1):- IO Size IOPs 4MB 25 1MB 52 256KB 73 64KB 81 4KB 83 As you can see random IO performance really starts to drop off once you go above about 256kb IO size and below that there is diminishing returns as bandwidth drops of dramatically with lower IO sizes. When using an EC pool each object is split into shards and stored on each disk. I'm using a k=3 EC pool, so that will mean a 4MB object will be split into about 1.3MB shards, which as can be see above is really not the best size of IO for random performance. I then created 4 RBD's with the following object sizes 4MB,2MB,1MB and 512KB, filled them with data, evicted the cache pool and then using Fio, performed random read 64kb IO's. Results as follows:- RBD Obj Size Shard Size 64k IOPs 4MB 1.33MB 24 2MB 0.66MB 38 1MB .33MB 51 512K .17MB 58 As can be seen there is a doubling of random IO performance between 4MB and 1MB object sizes and looking at the shard sizes, this correlates quite nicely with the disk benchmarks I did earlier. Going to a 512kb Object size, does improve performance but it is starting to tail off. The other benefit of using a smaller object size seems to be that the cache tier is much more effective at caching hot blocks as a single IO promotes/evicts less data. I don't have any numbers on this yet, but I will try and get Fio to create a hot spot pattern so I can generate some reliable figures. But from just using the RBD's it certainly feels like the cache is doing a much better job with 1MB object sizes. Incidentally I looked at some RAID 6 write benchmarks with varying chunk sizes as when doing a write you have to read the whole stripe back. Most of these benchmarks also show performance dropping off past 256/512kb chunk sizes. The next thing I tried was to change the read_expire parameter of the deadline scheduler to 30ms to make sure that reads are prioritised even more than default. Again I don't have numbers for this yet, but watching iostat seems to show that reads are happening with a much more predictable latency. Delaying the writes should not be such a problem as the journals buffer them. To summarize, trying to get your EC shards to be around 256kb seems to improve random IO at the cost of some bandwidth. If your EC pool has data that is rarely accessed or only does large block IOs then the default size probably won't have much of an impact. There is also the fact that you will now have 4 times more objects, I'm not sure what the impact of this is, maybe someone can comment? If anyone notices that I have made any mistakes in my tests of assumptions, please let me know. Nick _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com