Re: Cache mode readforward mode will eat your babies?

Christian Balzer <chibi@xxxxxxx> · Tue, 13 Jun 2017 12:12:35 +0900

Hello,

just to throw some hard numbers into the ring, I've (very much STRESS)
tested readproxy vs. readforward with more or less expected results.

New Jewel cluster, 3 cache-tier nodes (5 OSD SSDs each), 3 HDD nodes,
IPoIB network.
Notably 2x E5-2623 v3 @ 3.00GHz in the cache-tiers.

2 VMs (on different compute nodes, not that either network nor CPU were a
bottleneck there), running fio.
After creating a 12GB fio file on each and filling it, the cache was
flushed/evicted. 
The it was filled again for one VM by running write fio again, while
reading on the other VM. 
Resulting in hot pagecaches, slab, etc on all 6 OSD nodes.

Fio cmd line:
---
fio --size=12G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
---
With randread of course on the other VM.

Solo performance (with readproxy or readforward, no diff):
RandRead: 25k IOPS (these all go to the HDD nodes)
RandWrite: 21k IOPS (these all go to the SSD cache-tier nodes)

Note that during randwrites the cache tier nodes are using about 80-90% of
their CPU, OSD processes eating more than 300% each (of 1600 total).
Neither SSDs nor network are maxed out, the later far far from it.

Concurrent performance with readproxy:
RandRead: 6k IOPS (nearly no idle CPU left on the cache-tier nodes)
RandWrite: 20k IOPS

Concurrent performance with readforward:
RandRead: 14k IOPS
RandWrite: 20k IOPS

So writes seem to be not particular impacted by the forwarding/proxying
going on in parallel.
And while readforward still suffers from what I assume is CPU contention on
the cache-tiers, it is unsurprisingly more than twice as fast than
readproxy. Too bad that it will still eat your babies, supposedly.

And again no, the network is not the bottleneck.

Christian

On Fri, 9 Jun 2017 11:45:46 +0900 Christian Balzer wrote:

> On Thu, 8 Jun 2017 07:06:04 -0400 Alfredo Deza wrote:
> 
> > On Thu, Jun 8, 2017 at 3:38 AM, Christian Balzer <chibi@xxxxxxx> wrote:  
> > > On Thu, 8 Jun 2017 17:03:15 +1000 Brad Hubbard wrote:
> > >    
> > >> On Thu, Jun 8, 2017 at 3:47 PM, Christian Balzer <chibi@xxxxxxx> wrote:    
> > >> > On Thu, 8 Jun 2017 15:29:05 +1000 Brad Hubbard wrote:
> > >> >    
> > >> >> On Thu, Jun 8, 2017 at 3:10 PM, Christian Balzer <chibi@xxxxxxx> wrote:    
> > >> >> > On Thu, 8 Jun 2017 14:21:43 +1000 Brad Hubbard wrote:
> > >> >> >    
> > >> >> >> On Thu, Jun 8, 2017 at 1:06 PM, Christian Balzer <chibi@xxxxxxx> wrote:    
> > >> >> >> >
> > >> >> >> > Hello,
> > >> >> >> >
> > >> >> >> > New cluster, Jewel, setting up cache-tiering:
> > >> >> >> > ---
> > >> >> >> > Error EPERM: 'readforward' is not a well-supported cache mode and may corrupt your data.  pass --yes-i-really-mean-it to force.
> > >> >> >> > ---
> > >> >> >> >
> > >> >> >> > That's new and certainly wasn't there in Hammer, nor did it whine about
> > >> >> >> > this when upgrading my test cluster to Jewel.
> > >> >> >> >
> > >> >> >> > And speaking of whining, I did that about this and readproxy, but not
> > >> >> >> > their stability (readforward has been working nearly a year flawlessly in
> > >> >> >> > the test cluster) but their lack of documentation.
> > >> >> >> >
> > >> >> >> > So while of course there is no warranty for anything with OSS, is there
> > >> >> >> > any real reason for the above scaremongering or is that based solely on
> > >> >> >> > lack of testing/experience?    
> > >> >> >>
> > >> >> >> https://github.com/ceph/ceph/pull/8210 and
> > >> >> >> https://github.com/ceph/ceph/pull/8210/commits/90fe8e3d0b1ded6d14a6a43ecbd6c8634f691fbe
> > >> >> >> may offer some insight.
> > >> >> >>    
> > >> >> > They do, alas of course immediately raise the following questions:
> > >> >> >
> > >> >> > 1. Where is that mode documented?    
> > >> >>
> > >> >> It *was* documented by,
> > >> >> https://github.com/ceph/ceph/pull/7023/commits/d821acada39937b9dacf87614c924114adea8a58
> > >> >> in https://github.com/ceph/ceph/pull/7023 but was removed by
> > >> >> https://github.com/ceph/ceph/commit/6b6b38163b7742d97d21457cf38bdcc9bde5ae1a
> > >> >> in https://github.com/ceph/ceph/pull/9070
> > >> >>    
> > >> >
> > >> > I was talking about proxy, which isn't AFAICT, nor is there a BIG bold red    
> > >>
> > >> That was hard to follow for me, in a thread titled "Cache mode
> > >> readforward mode will eat your babies?".
> > >>    
> > > Context, the initial github bits talk about proxy.
> > >
> > > Anyways, the documentation is in utter shambles and wrong and this really
> > > really should have been mentioned more clearly in the release notes, but
> > > then again none of the other cache changes were, never mind the wrong
> > > osd_tier_promote_max* defaults.
> > >
> > > So for the record:
> > >
> > > The readproxy mode does what the old documentation states and proxies
> > > objects through the cache-tier when being read w/o promoting them[*], while
> > > writing objects will go into cache-tier as usual and with the
> > > rate configured.
> > >
> > > [*]
> > > Pro-Tip: It does however do the silent 0 byte object creation for reads,
> > > so your cache-tier storage performance will be somewhat impacted, in
> > > addition to the CPU usage there that readforward would have also avoided.
> > > This is important when considering the value for "target_max_objects", as a
> > > writeback mode cache will likely evict things based on space used and
> > > reach a natural upper object limit.
> > > For example an existing cache-tier in writeback mode here has a 2GB size
> > > and 560K objects, 13.4TB and 3.6M objects on the backing storage.
> > > With readproxy and a similar sized cluster I'll be setting
> > > "target_max_objects" to something around 2M to avoid needless eviction and
> > > then re-creation of null objects when things are read.    
> > 
> > Thank you for taking the time to explain this in the mailing list,
> > could you help us in submitting a pull request with this
> > documentation addition?
> >   
> 
> I'll review that whole page again, it's riddled with stuff.
> Like in the eviction settings talking about flushing all of a sudden,
> which doesn't help when most people are confused by the those two things
> initially anyway.
> 
> Christian
> 
> > I would be happy to review and merge.  
> > >
> > > Christian
> > >    
> > >> > statement in the release notes (or docs) for everybody to switch from
> > >> > (read)forward to (read)proxy.
> > >> >
> > >> > And the two bits up there have _very_ conflicting statements about what
> > >> > readproxy does, the older one would do what I want (at the cost of
> > >> > shuffling all through the cache-tier network pipes), the newer one seems
> > >> > to be actually describing the proxy functionality (no new objects i.e from
> > >> > writes being added).
> > >> >
> > >> > I'll be ready to play with my new cluster in a bit and shall investigate
> > >> > what does actually what.
> > >> >
> > >> > Christian
> > >> >    
> > >> >> HTH.
> > >> >>    
> > >> >> >
> > >> >> > 2. The release notes aren't any particular help there either and issues/PR
> > >> >> > talk about forward, not readforward as the culprit.
> > >> >> >
> > >> >> > 3. What I can gleam from the bits I found, proxy just replaces the forward
> > >> >> > functionality.  Alas what I'm after is a mode that will not promote reads
> > >> >> > to the cache, aka readforward. Or another set of parameters that will
> > >> >> > produce the same results.
> > >> >> >
> > >> >> > Christian
> > >> >> >    
> > >> >> >> >
> > >> >> >> > Christian
> > >> >> >> > --
> > >> >> >> > Christian Balzer        Network/Systems Engineer
> > >> >> >> > chibi@xxxxxxx           Rakuten Communications
> > >> >> >> > _______________________________________________
> > >> >> >> > ceph-users mailing list
> > >> >> >> > ceph-users@xxxxxxxxxxxxxx
> > >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com    
> > >> >> >>
> > >> >> >>
> > >> >> >>    
> > >> >> >
> > >> >> >
> > >> >> > --
> > >> >> > Christian Balzer        Network/Systems Engineer
> > >> >> > chibi@xxxxxxx           Rakuten Communications    
> > >> >>
> > >> >>
> > >> >>    
> > >> >
> > >> >
> > >> > --
> > >> > Christian Balzer        Network/Systems Engineer
> > >> > chibi@xxxxxxx           Rakuten Communications    
> > >>
> > >>
> > >>    
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi@xxxxxxx           Rakuten Communications
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com    
> >   
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com