Re: Periodic evicting & flushing

Maran <maran@xxxxxxxxxxxxxx> · Wed, 23 Mar 2016 04:46:50 -0400

Hey,

-------- Original Message --------
Hello,

On Wed, 23 Mar 2016 03:24:33 -0400 Maran wrote:

>> Hey,
>> 
>> Thanks for the prompt response.
>> 
>> Let me put some inline comments.
>>
>Those are much more readable if properly indented/quoted by ye olde ">".

My apologies, I'm using a new webservice that sadly is forcing me to use a WYSIWYG editor that I can't turn off for the moment. I'm doing some copy-pasting this time around to remove as much as the HTML as I can. Hope this reads better.

>> Hello,
>> 
>> On Tue, 22 Mar 2016 12:28:22 -0400 Maran wrote:
>> 
> >> Hey guys,
> >>
> >> I'm trying to wrap my head about the Ceph Cache Tiering to discover if
> >> what I want is achievable.
> >>
> >> My cluster exists of 6 OSD nodes with normal HDD and one cache tier of
> >> SSDs.
> ><
>> One cache tier being what, one node?
>> That's a SPOF and disaster waiting to happen.
>> Thanks for the heads-up. It's just a test cluster at this point and a
>> second cache node should be added soon.
>>
> While you certainly can get away with a replication size of 2 with
> reliable and well monitored (SMART, media wearout) SSDs, keep in mind that
> all you critical, most updated data will be in the cache and only the
> cache.
> I got a dual node cache tier but would feel more comfortable if it was a
> triple one.
This is an other reason I would love to flush files more often. Once files have been written there is usually quite a long time where files won't be retrieved again, and if they are to be retrieved it doesn't have to be very fast in most cases. In these instances the 'penalty' of retrieving it from the slow pool is worth it. The cluster should always be optimised for fast writing.

>> 
>> Also the usual (so we're not comparing apples with oranges), as in what
>> types of SSDs, OS, Ceph versions, network, everything.
>> SSDs are Samsung MZ7KM120HAFD. Ceph 9.2.1. 10Gbit cluster and a 10Gbit
>> public network. HDDs are consumer grade 3TB disks.
>>
>If that's a test cluster you may as well do an early adoption of Jewel when
>it's released.
Since this is my first time playing with Ceph I wanted to get familiar with the usual commands first before I took a dive into unstable branches. I already made my whole cluster wipe itself out playing around with different erasure profiles and this was on a stable branch. Next phase I will try out if I can make it crash again and submit a proper issue report, but that's a totally different topic ;)

>Those SSDs should be fine, I just don't have any first hand experience
>with them.
Since we are on the topic anyway, what would you recommend for SSD? 

> 
> > >What I would love is that Ceph flushes and evicts data as soon as a
> > >file hasn't been requested by a client in a certain timeframe, even if
> > >there is enough space to keep it there longer. The reason I would
> > >prefer this is that I have a feeling overall performance suffers if
> > >new writes are coming into the cache tier while at the same time flush
> > >and evicts are happening.
> >
> >You will want to read my recent thread titled
> >"Cache tier operation clarifications"
> 
> >where I asked for something along those lines.
> 
> >The best thing you could do right now and which I'm planning to do if
> >flushing (evictions should be very light impact wise) turns out to be
> >detrimental performance wise is to lower the ratios at low utilization
> >times and raise them again for peak times.
> >Again, read the thread above.
> >I've read your post, quite an interesting read, thanks for sharing your
> >findings. One thing that wasn't clear to me is why you would prefer
> >setting the ratio instead of forcing it with `rados -p cache
> >cache-try-flush-evict-all`. Is there a difference between the two ways
> >of flushing/evicting?
> 
> I would definitely NOT want to use that command as it does exactly what it
> names implies, flush AND evict.
> Flushes write dirty data back to the base pool, they can cause a lot of I/O
> and data transfer, so I would like to time them accordingly.

> Eviction removes CLEAN (already flushed) objects from the cache (to keep
> enough space free normally with the ratios set). 
> That's just shortening the object to 0 bytes, a very low impact action.
> But you do NOT want to evict objects unless required, as they may become
> hot again.

> That's why I asked for a "cache-try-flush" like command, preferably with a
> target ratio of dirty objects and an IO priority setting as well.
Very good point, I didn't fully realise this. 

> 
> > >It also seems that for some reason my cache node is not using the
> > >cluster network as much as I expected. Where all HDD nodes are using
> > >the cluster network to the fullest (multiple TBs) my SSD node only
> > >used 1GB on the cluster network. Is there anyway to diagnose this
> > > problem or is this intended behaviour? I expected the flushes to
> > > happen over the cluster network.
> >
> > That is to be expected, as the cache tier is a client from the Ceph
> > perspective.
> 
> > Unfortunate, but AFAIK there are no plans to change this behavior.
> > This could explain the drop in performance I'm seeing then since, might
> >bond the NICs in that case to get some more bandwidth.
> 
> Define drop in performance. 
It 'seems', but I need to put more tests into this, that when flushing files the overall write speed of files coming into the cluster goes down. Which would make sense since less IOPs are available. The problem is that in theory it seems that this is the expected behaviour. It fils the cache until the cache_target_dirty_ratio and then every new write also forces a flush to the slower disks. So in theory this is what your cluster will be doing most of the time unless you start bringing the ratio down during quiet times, like you suggested. 

> Note that with a cache tier in normal (writeback) mode all traffic will
> have to go through the cache node(s). 
> So instead in your case of writing or reading(should be more visible here)
> to 6 nodes, you just interact with 1.
> Of course bandwidth is for most use cases the thing you will need the
> least, as opposed to many, fast IOPS.

> I'm happy with a single network for client and cluster traffic, but that
> all depends on your needs (client traffic) and abilities of your storage
> nodes.

> 
> >> I appreciate any pointers you might have for me.
> >
> >You will also want to definitely read the recent thread titled
> >"data corruption with hammer".
> >Not sure this is relevant for the version I'm running?
> >
> Nope, but Infernalis breaks things with EC pools at least as well, AFAIK.
> Jewel should be fine.

>> One other question I have is would it make sense to run RAID-0 for
>> improved write performance if that's something I value over more OSDs
>> per node?
>> 
> Aside from RAID0 not going to have performance benefits (except maybe for
> large, sequential writes) over individual OSDs it also introduces another
> SPOF or at least a significantly larger impact in case of a SSD failure.

Understood, I will forgo this plan. Any other tips that could help bring up write speeds?

Maran
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com