Re: Persistent Write Back Cache

Christian Balzer <chibi@xxxxxxx> · Thu, 5 Mar 2015 09:50:44 +0900

Hello Nick,

On Wed, 4 Mar 2015 08:49:22 -0000 Nick Fisk wrote:

> Hi Christian,
> 
> Yes that's correct, it's on the client side. I don't see this much
> different to a battery backed Raid controller, if you lose power, the
> data is in the cache until power resumes when it is flushed.
> 
> If you are going to have the same RBD accessed by multiple
> servers/clients then you need to make sure the SSD is accessible to both
> (eg DRBD / Dual Port SAS). But then something like pacemaker would be
> responsible for ensuring the RBD and cache device are both present
> before allowing client access.
> 
Which is pretty much any and all use cases I can think about.
Because it's not only concurrent (active/active) accesses, but you
really need to have things consistent across all possible client hosts in
case of a node failure.

I'm no stranger to DRBD and Pacemaker (which incidentally didn't make it
into Debian Jessie, queue massive laughter and ridicule), btw.

> When I wrote this I was thinking more about 2 HA iSCSI servers with
> RBD's, however I can understand that this feature would prove more of a
> challenge if you are using Qemu and RBD.
> 
One of the reasons I'm using Ceph/RBD instead of DRBD (which is vastly
more suited for some use cases) is that it allows me n+1 instead of n+n
redundancy when it comes to consumers (compute nodes in my case). 

Now for your iSCSI head (looking forward to your results and any config
recipes) that limitation to a pair may be just as well, but as others
wrote it might be best to go forward with this outside of Ceph.
Especially since you're already dealing with a HA cluster/pacemaker in
that scenario.

Christian

> Nick
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Christian Balzer
> Sent: 04 March 2015 08:40
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: Nick Fisk
> Subject: Re:  Persistent Write Back Cache
> 
> 
> Hello,
> 
> If I understand you correctly, you're talking about the rbd cache on the
> client side.
> 
> So assume that host or the cache SSD on if fail terminally.
> The client thinks its sync'ed are on the permanent storage (the actual
> ceph storage cluster), while they are only present locally. 
> 
> So restarting that service or VM on a different host now has to deal with
> likely crippling data corruption.
> 
> Regards,
> 
> Christian
> 
> On Wed, 4 Mar 2015 08:26:52 -0000 Nick Fisk wrote:
> 
> > Hi All,
> > 
> >  
> > 
> > Is there anything in the pipeline to add the ability to write the 
> > librbd cache to ssd so that it can safely ignore sync requests? I have 
> > seen a thread a few years back where Sage was discussing something 
> > similar, but I can't find anything more recent discussing it.
> > 
> >  
> > 
> > I've been running lots of tests on our new cluster, buffered/parallel 
> > performance is amazing (40K Read 10K write iops), very impressed. 
> > However sync writes are actually quite disappointing.
> > 
> >  
> > 
> > Running fio with 128k block size and depth=1, normally only gives me 
> > about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's 
> > and from what I hear that's about normal, so I don't think I have a 
> > ceph config problem. For applications which do a lot of sync's, like 
> > ESXi over iSCSI or SQL databases, this has a major performance impact.
> > 
> >  
> > 
> > Traditional storage arrays work around this problem by having a 
> > battery backed cache which has latency 10-100 times less than what you 
> > can currently achieve with Ceph and an SSD . Whilst librbd does have a 
> > writeback cache, from what I understand it will not cache syncs and so 
> > in my usage case, it effectively acts like a write through cache.
> > 
> >  
> > 
> > To illustrate the difference a proper write back cache can make, I put 
> > a 1GB (512mb dirty threshold) flashcache in front of my RBD and 
> > tweaked the flush parameters to flush dirty blocks at a large queue 
> > depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is 
> > limited by the performance of SSD used by flashcache, as everything is 
> > stored as 4k blocks on the ssd. In fact since everything is stored as 
> > 4k blocks, pretty much all IO sizes are accelerated to max speed of the
> SSD.
> > Looking at iostat I can see all the IO's are getting coalesced into 
> > nice large 512kb IO's at a high queue depth, which Ceph easily
> > swallows.
> > 
> >  
> > 
> > If librbd could support writing its cache out to SSD it would 
> > hopefully achieve the same level of performance and having it 
> > integrated would be really neat.
> > 
> >  
> > 
> > Nick
> > 
> > 
> > 
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com