Re: librbd cache

Stuart Longland <stuartl@xxxxxxxxxx> · Wed, 07 Jan 2015 06:42:11 +1000

Hi all, apologies for the slow reply.

Been flat out lately and so any cluster work has been relegated to the
back-burner.  I'm only just starting to get back to it now.

On 06/06/14 01:00, Sage Weil wrote:
> On Thu, 5 Jun 2014, Wido den Hollander wrote:
>> On 06/05/2014 08:59 AM, Stuart Longland wrote:
>>> Hi all,
>>>
>>> I'm looking into other ways I can boost the performance of RBD devices
>>> on the cluster here and I happened to see these settings:
>>>
>>> http://ceph.com/docs/next/rbd/rbd-config-ref/
>>>
>>> A query, is it possible for the cache mentioned there to be paged out to
>>> swap residing on a SSD or is it purely RAM-only?
> 
> Right now it is RAM only.
> 
>>> I see mention of cache-tiers, but these will be at the wrong end of the
>>> Ethernet cable for my usage: I want the cache on the Ceph clients
>>> themselves not back at the OSDs.
>>>
>>
>> So you want this to serve as a read cache as well?

Yes, this is probably more important to my needs than write cache.  The
disks in the storage (OSD+MON) nodes are fast enough, but the problem
seems to be the speed at which data can be shunted across the network.

The storage nodes each have one gigabit NIC on the "server" network
(exposed to clients) and one on a back-end "storage" network.

I want to eventually put another two network cards in those boxes, but
2U-compatible cards aren't that common, and the budget is not high.
(10GbE can't come down in price fast enough either.  AU$600 a card?  Ouch!)

>> The librbd cache is mainly used as a write-cache for small writes, it's not
>> indented to be a large read cache.
> 
> Right.  There was a blueprint describing a larger (shared) read cache that 
> could be stored on a local SSD or file system, but it hasn't moved beyond 
> the concept stage.
> 
> 	http://wiki.ceph.com/Planning/Sideboard/rbd%3A_shared_read_cache

Ahh okay, so a future release.  That document also answers another
question I had, that being "was the RAM cache shared between all RBDs on
a client or per-RBD?".  The answer of course, it's per RBD.

In the interests of science I did some testing over the last couple of
days.  When I deployed the cluster I used the (then latest) Emperor
release.  Monday I did an update to the Firefly release, checked
everything over, then moved to Giant.

So the storage nodes are now all running ceph version 0.87
(c51c8f9d80fa4e0168aa52685b8de40e42758578).  OS there is Ubuntu 12.04 LTS.

I have my laptop plugged in to the "client" network (so one router hop
away) with its on-board gigabit interface, and decided to do some tests
there with a KVM virtual machine.  The host OS is Gentoo with ceph
version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) and QEMU
2.1.2.  The machine itself is a Core i5 with 8GB RAM.

My VM had 256MB RAM, ran Debian Wheezy with two RBD "virtio" disks (8GB
OS and 80GB data), and I used bonnie++ on the data RBD formatted xfs
(default mkfs.xfs options).

The tests were each conducted by starting up the VM, logging in,
performing a test with bonnie++ (8GB file and specifying 256MB RAM,
otherwise using defaults), then powering off the VM before altering
/etc/ceph/ceph.conf for the next test.

With the stock Ceph cache settings, so 32MB RBD cache, default writeback
threshold, I get the following from bonnie++:
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> debian           8G  1631  81 11260   0  4353   0  2913  97 11046   1 112.7   2
> Latency              6564us     539ms     863ms   16660us     433ms     587ms
> Version  1.96       ------Sequential Create------ --------Random Create--------
> debian              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                  16 11833  22 +++++ +++ 25112  44  9567  17 +++++ +++ 15944  28
> Latency               441ms     149us     138us     765ms      39us      97us
> 1.96,1.96,debian,1,1420488810,8G,,1631,81,11260,0,4353,0,2913,97,11046,1,112.7,2,16,,,,,11833,22,+++++,+++,25112,44,9567,17,+++++,+++,15944,28,6564us,539ms,863ms,16660us,433ms,587ms,441ms,149us,138us,765ms,39us,97us

If I disable writeback and up the RBD cache to 2GB, I get:
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> debian           8G  1506  82 11225   0  4091   0  2096  51  9227   0 117.3   3
> Latency              8966us    2540ms    1554ms     472ms    2190ms     747ms
> Version  1.96       ------Sequential Create------ --------Random Create--------
> debian              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                  16  7795  21 +++++ +++ 28582  50 11150  26 +++++ +++ 22512  41
> Latency               441ms     102us     132us     441ms      40us     200us
> 1.96,1.96,debian,1,1420516613,8G,,1506,82,11225,0,4091,0,2096,51,9227,0,117.3,3,16,,,,,7795,21,+++++,+++,28582,50,11150,26,+++++,+++,22512,41,8966us,2540ms,1554ms,472ms,2190ms,747ms,441ms,102us,132us,441ms,40us,200us

4GB cache gives me:
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> debian           8G  1631  81 11225   1  4310   0  1647  46 10896   1 115.6   3
> Latency              6839us    2223ms    1594ms     497ms     405ms     731ms
> Version  1.96       ------Sequential Create------ --------Random Create--------
> debian              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                  16 10961  27 +++++ +++ 24383  41  5829  13 +++++ +++ 24075  45
> Latency               441ms     128us     149us     441ms     290us     205us
> 1.96,1.96,debian,1,1420502264,8G,,1631,81,11225,1,4310,0,1647,46,10896,1,115.6,3,16,,,,,10961,27,+++++,+++,24383,41,5829,13,+++++,+++,24075,45,6839us,2223ms,1594ms,497ms,405ms,731ms,441ms,128us,149us,441ms,290us,205us

At which point I start to hit the point of diminishing returns, 8GB
gives me:
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> debian           8G  1386  83 11222   1  4295   0  3228  90 10703   1 101.8   2
> Latency              9308us    4024ms    1527ms   50255us     579ms     652ms
> Version  1.96       ------Sequential Create------ --------Random Create--------
> debian              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                  16 10516  20 +++++ +++ 27863  46  7191  16 +++++ +++ 13528  25
> Latency               442ms     339us     220us    1107ms      63us      79us
> 1.96,1.96,debian,1,1420511573,8G,,1386,83,11222,1,4295,0,3228,90,10703,1,101.8,2,16,,,,,10516,20,+++++,+++,27863,46,7191,16,+++++,+++,13528,25,9308us,4024ms,1527ms,50255us,579ms,652ms,442ms,339us,220us,1107ms,63us,79us

I suspect that at 8GB, when librbd tries to allocate an 8GB chunk of
RAM, the kernel tells it where to go, which is maybe why sequential
input latency goes up for per-byte reads.  4GB appears to be a
sweet-spot with this set-up.

The actual VM hosts have 16GB RAM, so are much better off there.  They
run dual-core Core i3s, the idea being that the machines are cheap
enough to have a lot of them and we just run one or two VMs per host.

They have 60GB SSDs which theoretically could be used for local cache,
but it seems I either put more effort into making the FlashCache+RBD
solution work within OpenNebula or I just leave it for now and tune the
Ceph client as best I can.

If I set the RBD cache at 4GB, that'd allow two RBDs and still leave 4GB
over for each VM on that host.  It may also be possible to coax
OpenNebula to emit some suitable configuration for QEMU/libvirt that can
fine-tune that per RBD, I'll have to investigate.

Regards,
-- 
     _ ___             Stuart Longland - Systems Engineer
\  /|_) |                           T: +61 7 3535 9619
 \/ | \ |     38b Douglas Street    F: +61 7 3535 9699
   SYSTEMS    Milton QLD 4064       http://www.vrt.com.au
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com