Re: Erasure pool performance expectations

Christian Balzer <chibi@xxxxxxx> · Fri, 6 May 2016 17:52:19 +0900

Hello,

On Fri, 6 May 2016 09:58:31 +0200 Peter Kerdisle wrote:

> Hey Mark,
> 
> Sorry I missed your message as I'm only subscribed to daily digests.
> 
> 
> > Date: Tue, 3 May 2016 09:05:02 -0500
> > From: Mark Nelson <mnelson@xxxxxxxxxx>
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  Erasure pool performance expectations
> > Message-ID: <df3de049-a7f9-7f86-3ed3-47079e4012b9@xxxxxxxxxx>
> > Content-Type: text/plain; charset=windows-1252; format=flowed
> > In addition to what nick said, it's really valuable to watch your cache
> > tier write behavior during heavy IO.  One thing I noticed is you said
> > you have 2 SSDs for journals and 7 SSDs for data.
> 
> 
> I thought the hardware recommendations were 1 journal disk per 3 or 4
> data disks but I think I might have misunderstood it. 

Firstly that's the ratio for journal SSDs to OSD HDDs.
Secondly, recommendations are all dandy, but you will want to understand
the reasons for them and why they may not apply to your use case.

So as Mark mentioned, the performance of journals to data devices needs to
match. 
More below.

>Looking at my
> journal read/writes they seem to be ok though:
> https://www.dropbox.com/s/er7bei4idd56g4d/Screenshot%202016-05-06%2009.55.30.png?dl=0
>
That's a bit ambiguous, what Ceph performance counter (or other source) is
that based on?
And journal SSDs will never see any reads, unless during crash recovery. 

That said, if these are indeed MB/s writes to your journal SSDs, it
shouldn't be a problem.
Of course you want to verify that with atop, iostat, etc.

> However I started running into a lot of slow requests (made a separate
> thread for those: Diagnosing slow requests) and now I'm hoping these
> could be related to my journaling setup.
>
Doubt it, unless your running out of other resources like CPU.
The most likely suspect would be what you mentioned in that thread,
network.

That or your EC pool really being so overloaded that it drops the ball.

> 
> > If they are all of
> > the same type, you're likely bottlenecked by the journal SSDs for
> > writes, which compounded with the heavy promotions is going to really
> > hold you back.
> > What you really want:
> > 1) (assuming filestore) equal large write throughput between the
> > journals and data disks.
> 
> How would one achieve that?
> 
You never mentioned what actual SSDs you're using.
If they're all the same, forget about dedicated journals (the performance
advantage isn't that great, especially when considering your backing pool)
and make them all cache OSDs.

If the journal ones are significantly faster than the data ones (I doubt
it unless they are large NVMes), make sure that their write speed 
roughly matches the combined write speed of the data SSDs.

Of course there is little point to spec things faster than your network
speed.

> >
> > 2) promotions to be limited by some reasonable fraction of the cache
> > tier and/or network throughput (say 70%).  This is why the
> > user-configurable promotion throttles were added in jewel.
> 
> Are these already in the docs somewhere?
> 
Doubt it, the potentially quite useful readforward and readproxy modes
weren't either the last time I looked.

But Nick mentioned them (and the confusion of their default values).

Christian

> >
> > 3) The cache tier to fill up quickly when empty but change slowly once
> > it's full (ie limiting promotions and evictions).  No real way to do
> > this yet.
> > Mark
> 
> 
> Thanks for your thoughts.
> 
> Peter

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com