Re: Periodic evicting & flushing

Christian Balzer <chibi@xxxxxxx> · Thu, 24 Mar 2016 12:39:19 +0900

Hello,

On Wed, 23 Mar 2016 04:46:50 -0400 Maran wrote:

> Hey,
> 
> -------- Original Message --------
> Hello,
> 
> On Wed, 23 Mar 2016 03:24:33 -0400 Maran wrote:
> 
> >> Hey,
> >>
> >> Thanks for the prompt response.
> >>
> >> Let me put some inline comments.
> >>
> >Those are much more readable if properly indented/quoted by ye olde ">".
> 
> My apologies, I'm using a new webservice that sadly is forcing me to use
> a WYSIWYG editor that I can't turn off for the moment. I'm doing some
> copy-pasting this time around to remove as much as the HTML as I can.
> Hope this reads better.
> 
> >> Hello,
> >>
> >> On Tue, 22 Mar 2016 12:28:22 -0400 Maran wrote:
> >>
> > >> Hey guys,
> > >>
> > >> I'm trying to wrap my head about the Ceph Cache Tiering to discover
> > >> if what I want is achievable.
> > >>
> > >> My cluster exists of 6 OSD nodes with normal HDD and one cache tier
> > >> of SSDs.
> > ><
> >> One cache tier being what, one node?
> >> That's a SPOF and disaster waiting to happen.
> >> Thanks for the heads-up. It's just a test cluster at this point and a
> >> second cache node should be added soon.
> >>
> > While you certainly can get away with a replication size of 2 with
> > reliable and well monitored (SMART, media wearout) SSDs, keep in mind
> > that all your critical, most updated data will be in the cache and only
> > the cache.
> > I got a dual node cache tier but would feel more comfortable if it was
> > a triple one.
> This is an other reason I would love to flush files more often. Once
> files have been written there is usually quite a long time where files
> won't be retrieved again, and if they are to be retrieved it doesn't
> have to be very fast in most cases. In these instances the 'penalty' of
> retrieving it from the slow pool is worth it. The cluster should always
> be optimised for fast writing.
> 
This ("write once") very much depends on your use case really.
Also always keep in mind that your files on the client side are really
just a collections of (by default) 4MB blobs aka objects to Ceph.
And is is these which get promoted into and flushed/evicted from a cache
tier.

At the very least the objects containing busy inodes (client side
directories that are frequently written to) will always be in the cache,
and sufficiently hot to not be flushed or evicted.

This becomes even more pronounced when looking at databases or status/data
files that get updated often.

That being said, there is a proxy mode called "readforward" which does
exactly what the name implies, forwarding ALL reads to objects not already
in the cache pool to the base pool. 
However this mode isn't really documented and thus likely not tested well
either. 
It certainly seem to work fine, but I have hardly run a conclusive set of
tests with it.

> >>
> >> Also the usual (so we're not comparing apples with oranges), as in
> >> what types of SSDs, OS, Ceph versions, network, everything.
> >> SSDs are Samsung MZ7KM120HAFD. Ceph 9.2.1. 10Gbit cluster and a 10Gbit
> >> public network. HDDs are consumer grade 3TB disks.
> >>
> >If that's a test cluster you may as well do an early adoption of Jewel
> >when it's released.
> Since this is my first time playing with Ceph I wanted to get familiar
> with the usual commands first before I took a dive into unstable
> branches. I already made my whole cluster wipe itself out playing around
> with different erasure profiles and this was on a stable branch. Next
> phase I will try out if I can make it crash again and submit a proper
> issue report, but that's a totally different topic ;)
> 
Jewel, when released, will be the next long-term stable branch, so it
would be good starting point unless you're planning to deploy something in
production next week.

> >Those SSDs should be fine, I just don't have any first hand experience
> >with them.
> Since we are on the topic anyway, what would you recommend for SSD?
> 
Ones that people have tested, there are dozens of threads in this ML,
read a few like the "List of SSDs" one.

Especially when it comes to SYNC writes (journal), where only DC level
SSDs tend to pass the muster. 

I'm using exclusively Intel DC S3700 or S3610 at this point, but will look
into Samsung in the future as well.

> >
> > > >What I would love is that Ceph flushes and evicts data as soon as a
> > > >file hasn't been requested by a client in a certain timeframe, even
> > > >if there is enough space to keep it there longer. The reason I would
> > > >prefer this is that I have a feeling overall performance suffers if
> > > >new writes are coming into the cache tier while at the same time
> > > >flush and evicts are happening.
> > >
> > >You will want to read my recent thread titled
> > >"Cache tier operation clarifications"
> >
> > >where I asked for something along those lines.
> >
> > >The best thing you could do right now and which I'm planning to do if
> > >flushing (evictions should be very light impact wise) turns out to be
> > >detrimental performance wise is to lower the ratios at low utilization
> > >times and raise them again for peak times.
> > >Again, read the thread above.
> > >I've read your post, quite an interesting read, thanks for sharing
> > >your findings. One thing that wasn't clear to me is why you would
> > >prefer setting the ratio instead of forcing it with `rados -p cache
> > >cache-try-flush-evict-all`. Is there a difference between the two ways
> > >of flushing/evicting?
> >
> > I would definitely NOT want to use that command as it does exactly
> > what it names implies, flush AND evict.
> > Flushes write dirty data back to the base pool, they can cause a lot
> > of I/O and data transfer, so I would like to time them accordingly.
> 
> > Eviction removes CLEAN (already flushed) objects from the cache (to
> > keep enough space free normally with the ratios set).
> > That's just shortening the object to 0 bytes, a very low impact action.
> > But you do NOT want to evict objects unless required, as they may
> > become hot again.
> 
> > That's why I asked for a "cache-try-flush" like command, preferably
> > with a target ratio of dirty objects and an IO priority setting as
> > well.
> Very good point, I didn't fully realise this.
> 
> >
> > > >It also seems that for some reason my cache node is not using the
> > > >cluster network as much as I expected. Where all HDD nodes are using
> > > >the cluster network to the fullest (multiple TBs) my SSD node only
> > > >used 1GB on the cluster network. Is there anyway to diagnose this
> > > > problem or is this intended behaviour? I expected the flushes to
> > > > happen over the cluster network.
> > >
> > > That is to be expected, as the cache tier is a client from the Ceph
> > > perspective.
> >
> > > Unfortunate, but AFAIK there are no plans to change this behavior.
> > > This could explain the drop in performance I'm seeing then since,
> > > might
> > >bond the NICs in that case to get some more bandwidth.
> >
> > Define drop in performance.
> It 'seems', but I need to put more tests into this, that when flushing
> files the overall write speed of files coming into the cluster goes
> down. Which would make sense since less IOPs are available. The problem
> is that in theory it seems that this is the expected behaviour. It fils
> the cache until the cache_target_dirty_ratio and then every new write
> also forces a flush to the slower disks. So in theory this is what your
> cluster will be doing most of the time unless you start bringing the
> ratio down during quiet times, like you suggested.
> 

This would seem to be the case in a synthetic test where you're writing
fresh data to the cluster all the time.
As I wrote above, that's not what I'm seeing in my use case, where the
vast majority of writes happen to the same small amount of objects over
and over again.

And since we're dealing with 4MB of data each time an object gets promoted
or flushed, that's impacting things more than the potentially tiny amount
of data that was read/written by the client side.

But yes, having a better control over flushes is something that's
definitely desirable.

> > Note that with a cache tier in normal (writeback) mode all traffic will
> > have to go through the cache node(s).
> > So instead in your case of writing or reading(should be more visible
> > here) to 6 nodes, you just interact with 1.
> > Of course bandwidth is for most use cases the thing you will need the
> > least, as opposed to many, fast IOPS.
> 
> > I'm happy with a single network for client and cluster traffic, but
> > that all depends on your needs (client traffic) and abilities of your
> > storage nodes.
> 
> 
> >
> > >> I appreciate any pointers you might have for me.
> > >
> > >You will also want to definitely read the recent thread titled
> > >"data corruption with hammer".
> > >Not sure this is relevant for the version I'm running?
> > >
> > Nope, but Infernalis breaks things with EC pools at least as well,
> > AFAIK. Jewel should be fine.
> 
> >> One other question I have is would it make sense to run RAID-0 for
> >> improved write performance if that's something I value over more OSDs
> >> per node?
> >>
> > Aside from RAID0 not going to have performance benefits (except maybe
> > for large, sequential writes) over individual OSDs it also introduces
> > another SPOF or at least a significantly larger impact in case of a
> > SSD failure.
> 
> Understood, I will forgo this plan. Any other tips that could help bring
> up write speeds?
> 
With speed you seem to mean only bandwidth, not IOPS. 
While I don't know what you plan do do with Ceph, typically client traffic
(from VMs) will hit the limits of your hardware when it comes to IOPS long
before it runs out of bandwidth. 
Of course it's nice to have sufficient bandwidth when there are peak
demands, but given the choice (by your budget) to have something that can
write "fast enough" with ample IOPS versus something that can write super
fast but with low IOPS I know what I would choose.

Anyway, what speeds are you seeing (how are you testing)?
And what speeds would you expect given your HW?

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com