Re: Periodic evicting & flushing

Christian Balzer <chibi@xxxxxxx> · Wed, 23 Mar 2016 17:07:08 +0900

Hello,

On Wed, 23 Mar 2016 03:24:33 -0400 Maran wrote:

> Hey,
> 
> Thanks for the prompt response.
> 
> Let me put some inline comments.
>
Those are much more readable if properly indented/quoted by ye olde ">".

> Hello,
> 
> On Tue, 22 Mar 2016 12:28:22 -0400 Maran wrote:
> 
> > Hey guys,
> >
> > I'm trying to wrap my head about the Ceph Cache Tiering to discover if
> > what I want is achievable.
> >
> > My cluster exists of 6 OSD nodes with normal HDD and one cache tier of
> > SSDs.
> >
> One cache tier being what, one node?
> That's a SPOF and disaster waiting to happen.
> Thanks for the heads-up. It's just a test cluster at this point and a
> second cache node should be added soon.
>
While you certainly can get away with a replication size of 2 with
reliable and well monitored (SMART, media wearout) SSDs, keep in mind that
all you critical, most updated data will be in the cache and only the
cache.
I got a dual node cache tier but would feel more comfortable if it was a
triple one.

> 
> Also the usual (so we're not comparing apples with oranges), as in what
> types of SSDs, OS, Ceph versions, network, everything.
> SSDs are Samsung MZ7KM120HAFD. Ceph 9.2.1. 10Gbit cluster and a 10Gbit
> public network. HDDs are consumer grade 3TB disks.
>
If that's a test cluster you may as well do an early adoption of Jewel when
it's released.
Those SSDs should be fine, I just don't have any first hand experience
with them.

> 
> > What I would love is that Ceph flushes and evicts data as soon as a
> > file hasn't been requested by a client in a certain timeframe, even if
> > there is enough space to keep it there longer. The reason I would
> > prefer this is that I have a feeling overall performance suffers if
> > new writes are coming into the cache tier while at the same time flush
> > and evicts are happening.
> >
> You will want to read my recent thread titled
> "Cache tier operation clarifications"
> 
> where I asked for something along those lines.
> 
> The best thing you could do right now and which I'm planning to do if
> flushing (evictions should be very light impact wise) turns out to be
> detrimental performance wise is to lower the ratios at low utilization
> times and raise them again for peak times.
> Again, read the thread above.
> I've read your post, quite an interesting read, thanks for sharing your
> findings. One thing that wasn't clear to me is why you would prefer
> setting the ratio instead of forcing it with `rados -p cache
> cache-try-flush-evict-all`. Is there a difference between the two ways
> of flushing/evicting?
> 
I would definitely NOT want to use that command as it does exactly what it
names implies, flush AND evict.
Flushes write dirty data back to the base pool, they can cause a lot of I/O
and data transfer, so I would like to time them accordingly.

Eviction removes CLEAN (already flushed) objects from the cache (to keep
enough space free normally with the ratios set). 
That's just shortening the object to 0 bytes, a very low impact action.
But you do NOT want to evict objects unless required, as they may become
hot again.

That's why I asked for a "cache-try-flush" like command, preferably with a
target ratio of dirty objects and an IO priority setting as well.

> 
> > It also seems that for some reason my cache node is not using the
> > cluster network as much as I expected. Where all HDD nodes are using
> > the cluster network to the fullest (multiple TBs) my SSD node only
> > used 1GB on the cluster network. Is there anyway to diagnose this
> > problem or is this intended behaviour? I expected the flushes to
> > happen over the cluster network.
> >
> That is to be expected, as the cache tier is a client from the Ceph
> perspective.
> 
> Unfortunate, but AFAIK there are no plans to change this behavior.
> This could explain the drop in performance I'm seeing then since, might
> bond the NICs in that case to get some more bandwidth.
> 
Define drop in performance. 
Note that with a cache tier in normal (writeback) mode all traffic will
have to go through the cache node(s). 
So instead in your case of writing or reading(should be more visible here)
to 6 nodes, you just interact with 1.
Of course bandwidth is for most use cases the thing you will need the
least, as opposed to many, fast IOPS.

I'm happy with a single network for client and cluster traffic, but that
all depends on your needs (client traffic) and abilities of your storage
nodes.

> 
> > I appreciate any pointers you might have for me.
> >
> You will also want to definitely read the recent thread titled
> "data corruption with hammer".
> Not sure this is relevant for the version I'm running?
> 
Nope, but Infernalis breaks things with EC pools at least as well, AFAIK.
Jewel should be fine.

> One other question I have is would it make sense to run RAID-0 for
> improved write performance if that's something I value over more OSDs
> per node?
> 
Aside from RAID0 not going to have performance benefits (except maybe for
large, sequential writes) over individual OSDs it also introduces another
SPOF or at least a significantly larger impact in case of a SSD failure.

Christian
> Thanks for your reply.
> 
> Maran

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com