Re: Periodic evicting & flushing

Maran <maran@xxxxxxxxxxxxxx> · Wed, 23 Mar 2016 03:24:33 -0400

Hey,

Thanks for the prompt response.

Let me put some inline comments.
Hello,

On Tue, 22 Mar 2016 12:28:22 -0400 Maran wrote:

> Hey guys,

> 

> I'm trying to wrap my head about the Ceph Cache Tiering to discover if

> what I want is achievable.

> 

> My cluster exists of 6 OSD nodes with normal HDD and one cache tier of

> SSDs.

> 

One cache tier being what, one node? 

That's a SPOF and disaster waiting to happen.
Thanks for the heads-up. It's just a test cluster at this point and a second cache node should be added soon. 

Also the usual (so we're not comparing apples with oranges), as in what

types of SSDs, OS, Ceph versions, network, everything.
SSDs are Samsung MZ7KM120HAFD. Ceph 9.2.1. 10Gbit cluster and a 10Gbit public network. HDDs are consumer grade 3TB disks.

> What I would love is that Ceph flushes and evicts data as soon as a file

> hasn't been requested by a client in a certain timeframe, even if there

> is enough space to keep it there longer. The reason I would prefer this

> is that I have a feeling overall performance suffers if new writes are

> coming into the cache tier while at the same time flush and evicts are

> happening.

> 

You will want to read my recent thread titled 

"Cache tier operation clarifications"

where I asked for something along those lines.

The best thing you could do right now and which I'm planning to do if

flushing (evictions should be very light impact wise) turns out to be

detrimental performance wise is to lower the ratios at low utilization

times and raise them again for peak times. 

Again, read the thread above.
I've read your post, quite an interesting read, thanks for sharing your findings. One thing that wasn't clear to me is why you would prefer setting the ratio instead of forcing it with `rados -p cache cache-try-flush-evict-all`. Is there a difference between the two ways of flushing/evicting?

> It also seems that for some reason my cache node is not using the

> cluster network as much as I expected. Where all HDD nodes are using the

> cluster network to the fullest (multiple TBs) my SSD node only used 1GB

> on the cluster network. Is there anyway to diagnose this problem or is

> this intended behaviour? I expected the flushes to happen over the

> cluster network.

>

That is to be expected, as the cache tier is a client from the Ceph

perspective. 

Unfortunate, but AFAIK there are no plans to change this behavior.
This could explain the drop in performance I'm seeing then since, might bond the NICs in that case to get some more bandwidth. 

> I appreciate any pointers you might have for me.

> 

You will also want to definitely read the recent thread titled 

"data corruption with hammer".
Not sure this is relevant for the version I'm running?

One other question I have is would it make sense to run RAID-0 for improved write performance if that's something I value over more OSDs per node?

Thanks for your reply. 

Maran
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com