Hi all, apologies for the slow reply. Been flat out lately and so any cluster work has been relegated to the back-burner. I'm only just starting to get back to it now. On 06/06/14 01:00, Sage Weil wrote: > On Thu, 5 Jun 2014, Wido den Hollander wrote: >> On 06/05/2014 08:59 AM, Stuart Longland wrote: >>> Hi all, >>> >>> I'm looking into other ways I can boost the performance of RBD devices >>> on the cluster here and I happened to see these settings: >>> >>> http://ceph.com/docs/next/rbd/rbd-config-ref/ >>> >>> A query, is it possible for the cache mentioned there to be paged out to >>> swap residing on a SSD or is it purely RAM-only? > > Right now it is RAM only. > >>> I see mention of cache-tiers, but these will be at the wrong end of the >>> Ethernet cable for my usage: I want the cache on the Ceph clients >>> themselves not back at the OSDs. >>> >> >> So you want this to serve as a read cache as well? Yes, this is probably more important to my needs than write cache. The disks in the storage (OSD+MON) nodes are fast enough, but the problem seems to be the speed at which data can be shunted across the network. The storage nodes each have one gigabit NIC on the "server" network (exposed to clients) and one on a back-end "storage" network. I want to eventually put another two network cards in those boxes, but 2U-compatible cards aren't that common, and the budget is not high. (10GbE can't come down in price fast enough either. AU$600 a card? Ouch!) >> The librbd cache is mainly used as a write-cache for small writes, it's not >> indented to be a large read cache. > > Right. There was a blueprint describing a larger (shared) read cache that > could be stored on a local SSD or file system, but it hasn't moved beyond > the concept stage. > > http://wiki.ceph.com/Planning/Sideboard/rbd%3A_shared_read_cache Ahh okay, so a future release. That document also answers another question I had, that being "was the RAM cache shared between all RBDs on a client or per-RBD?". The answer of course, it's per RBD. In the interests of science I did some testing over the last couple of days. When I deployed the cluster I used the (then latest) Emperor release. Monday I did an update to the Firefly release, checked everything over, then moved to Giant. So the storage nodes are now all running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578). OS there is Ubuntu 12.04 LTS. I have my laptop plugged in to the "client" network (so one router hop away) with its on-board gigabit interface, and decided to do some tests there with a KVM virtual machine. The host OS is Gentoo with ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) and QEMU 2.1.2. The machine itself is a Core i5 with 8GB RAM. My VM had 256MB RAM, ran Debian Wheezy with two RBD "virtio" disks (8GB OS and 80GB data), and I used bonnie++ on the data RBD formatted xfs (default mkfs.xfs options). The tests were each conducted by starting up the VM, logging in, performing a test with bonnie++ (8GB file and specifying 256MB RAM, otherwise using defaults), then powering off the VM before altering /etc/ceph/ceph.conf for the next test. With the stock Ceph cache settings, so 32MB RBD cache, default writeback threshold, I get the following from bonnie++: > Version 1.96 ------Sequential Output------ --Sequential Input- --Random- > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > debian 8G 1631 81 11260 0 4353 0 2913 97 11046 1 112.7 2 > Latency 6564us 539ms 863ms 16660us 433ms 587ms > Version 1.96 ------Sequential Create------ --------Random Create-------- > debian -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 11833 22 +++++ +++ 25112 44 9567 17 +++++ +++ 15944 28 > Latency 441ms 149us 138us 765ms 39us 97us > 1.96,1.96,debian,1,1420488810,8G,,1631,81,11260,0,4353,0,2913,97,11046,1,112.7,2,16,,,,,11833,22,+++++,+++,25112,44,9567,17,+++++,+++,15944,28,6564us,539ms,863ms,16660us,433ms,587ms,441ms,149us,138us,765ms,39us,97us If I disable writeback and up the RBD cache to 2GB, I get: > Version 1.96 ------Sequential Output------ --Sequential Input- --Random- > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > debian 8G 1506 82 11225 0 4091 0 2096 51 9227 0 117.3 3 > Latency 8966us 2540ms 1554ms 472ms 2190ms 747ms > Version 1.96 ------Sequential Create------ --------Random Create-------- > debian -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 7795 21 +++++ +++ 28582 50 11150 26 +++++ +++ 22512 41 > Latency 441ms 102us 132us 441ms 40us 200us > 1.96,1.96,debian,1,1420516613,8G,,1506,82,11225,0,4091,0,2096,51,9227,0,117.3,3,16,,,,,7795,21,+++++,+++,28582,50,11150,26,+++++,+++,22512,41,8966us,2540ms,1554ms,472ms,2190ms,747ms,441ms,102us,132us,441ms,40us,200us 4GB cache gives me: > Version 1.96 ------Sequential Output------ --Sequential Input- --Random- > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > debian 8G 1631 81 11225 1 4310 0 1647 46 10896 1 115.6 3 > Latency 6839us 2223ms 1594ms 497ms 405ms 731ms > Version 1.96 ------Sequential Create------ --------Random Create-------- > debian -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 10961 27 +++++ +++ 24383 41 5829 13 +++++ +++ 24075 45 > Latency 441ms 128us 149us 441ms 290us 205us > 1.96,1.96,debian,1,1420502264,8G,,1631,81,11225,1,4310,0,1647,46,10896,1,115.6,3,16,,,,,10961,27,+++++,+++,24383,41,5829,13,+++++,+++,24075,45,6839us,2223ms,1594ms,497ms,405ms,731ms,441ms,128us,149us,441ms,290us,205us At which point I start to hit the point of diminishing returns, 8GB gives me: > Version 1.96 ------Sequential Output------ --Sequential Input- --Random- > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > debian 8G 1386 83 11222 1 4295 0 3228 90 10703 1 101.8 2 > Latency 9308us 4024ms 1527ms 50255us 579ms 652ms > Version 1.96 ------Sequential Create------ --------Random Create-------- > debian -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 10516 20 +++++ +++ 27863 46 7191 16 +++++ +++ 13528 25 > Latency 442ms 339us 220us 1107ms 63us 79us > 1.96,1.96,debian,1,1420511573,8G,,1386,83,11222,1,4295,0,3228,90,10703,1,101.8,2,16,,,,,10516,20,+++++,+++,27863,46,7191,16,+++++,+++,13528,25,9308us,4024ms,1527ms,50255us,579ms,652ms,442ms,339us,220us,1107ms,63us,79us I suspect that at 8GB, when librbd tries to allocate an 8GB chunk of RAM, the kernel tells it where to go, which is maybe why sequential input latency goes up for per-byte reads. 4GB appears to be a sweet-spot with this set-up. The actual VM hosts have 16GB RAM, so are much better off there. They run dual-core Core i3s, the idea being that the machines are cheap enough to have a lot of them and we just run one or two VMs per host. They have 60GB SSDs which theoretically could be used for local cache, but it seems I either put more effort into making the FlashCache+RBD solution work within OpenNebula or I just leave it for now and tune the Ceph client as best I can. If I set the RBD cache at 4GB, that'd allow two RBDs and still leave 4GB over for each VM on that host. It may also be possible to coax OpenNebula to emit some suitable configuration for QEMU/libvirt that can fine-tune that per RBD, I'll have to investigate. Regards, -- _ ___ Stuart Longland - Systems Engineer \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas Street F: +61 7 3535 9699 SYSTEMS Milton QLD 4064 http://www.vrt.com.au _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com