Re: Erasure pool performance expectations

Nick Fisk <nick@xxxxxxxxxx> · Tue, 3 May 2016 15:24:51 +0100

Mark,

Thanks for pointing out about the throttles, they completely slipped my
mind. But then it got me thinking, why weren't they kicking in and stopping
too much promotions happening in the case of the OP.

I had a quick look at my current OSD settings

sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
promote
    "osd_tier_promote_max_objects_sec": "5242880",
    "osd_tier_promote_max_bytes_sec": "25",

Uh oh....they look the wrong way round to me?

Github shows the same

https://github.com/ceph/ceph/search?utf8=%E2%9C%93&q=osd_tier_promote_max_by
tes_sec

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Mark Nelson
> Sent: 03 May 2016 15:05
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Erasure pool performance expectations
> 
> In addition to what nick said, it's really valuable to watch your cache
tier write
> behavior during heavy IO.  One thing I noticed is you said you have 2 SSDs
for
> journals and 7 SSDs for data.  If they are all of the same type, you're
likely
> bottlenecked by the journal SSDs for writes, which compounded with the
> heavy promotions is going to really hold you back.
> 
> What you really want:
> 
> 1) (assuming filestore) equal large write throughput between the journals
> and data disks.
> 
> 2) promotions to be limited by some reasonable fraction of the cache tier
> and/or network throughput (say 70%).  This is why the user-configurable
> promotion throttles were added in jewel.
> 
> 3) The cache tier to fill up quickly when empty but change slowly once
it's full
> (ie limiting promotions and evictions).  No real way to do this yet.
> 
> Mark
> 
> On 05/03/2016 08:40 AM, Peter Kerdisle wrote:
> > Thank you, I will attempt to play around with these settings and see
> > if I can achieve better read performance.
> >
> > Appreciate your insights.
> >
> > Peter
> >
> > On Tue, May 3, 2016 at 3:00 PM, Nick Fisk <nick@xxxxxxxxxx
> > <mailto:nick@xxxxxxxxxx>> wrote:
> >
> >
> >
> >     > -----Original Message-----
> >     > From: Peter Kerdisle [mailto:peter.kerdisle@xxxxxxxxx
> >     <mailto:peter.kerdisle@xxxxxxxxx>]
> >     > Sent: 03 May 2016 12:15
> >     > To: nick@xxxxxxxxxx <mailto:nick@xxxxxxxxxx>
> >     > Cc: ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >     > Subject: Re:  Erasure pool performance expectations
> >     >
> >     > Hey Nick,
> >     >
> >     > Thanks for taking the time to answer my questions. Some in-line
> >     comments.
> >     >
> >     > On Tue, May 3, 2016 at 10:51 AM, Nick Fisk <nick@xxxxxxxxxx
> >     <mailto:nick@xxxxxxxxxx>> wrote:
> >     > Hi Peter,
> >     >
> >     >
> >     > > -----Original Message-----
> >     > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx
> >     <mailto:ceph-users-bounces@xxxxxxxxxxxxxx>] On Behalf
> >     > Of
> >     > > Peter Kerdisle
> >     > > Sent: 02 May 2016 08:17
> >     > > To: ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >     > > Subject:  Erasure pool performance expectations
> >     > >
> >     > > Hi guys,
> >     > >
> >     > > I am currently testing the performance of RBD using a cache pool
> >     and a 4/2
> >     > > erasure profile pool.
> >     > >
> >     > > I have two SSD cache servers (2 SSDs for journals, 7 SSDs for
> >     data) with
> >     > > 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public
> >     and 10Gbit
> >     > > cluster network for the erasure pool (10x3TB without separate
> >     journal).
> >     > This
> >     > > is all on Jewel.
> >     > >
> >     > > What I would like to know is if the performance I'm seeing is to
be
> >     > expected
> >     > > and if there is some way to test this in a more qualifiable way.
> >     > >
> >     > > Everything works as expected if the files are present on the
> >     cache pool,
> >     > > however when things need to be retrieved from the cache pool I
see
> >     > > performance degradation. I'm trying to simulate real usage as
> >     much as
> >     > > possible and trying to retrieve files from the RBD volume over
> >     FTP from a
> >     > > client server. What I'm seeing is that the FTP transfer will
> >     stall for seconds
> >     > at a
> >     > > time and then get some more data which results in an average
> >     speed of
> >     > > 200KB/s. From the cache this is closer to 10MB/s. Is this the
> >     expected
> >     > > behaviour from a erasure coded tier with cache in front?
> >     >
> >     > Unfortunately yes. The whole Erasure/Cache thing only really works
> >     well if
> >     > the data in the EC tier is only accessed infrequently, otherwise
> >     the overheads
> >     > in cache promotion/flushing quickly brings the cluster down to its
> >     knees.
> >     > However it looks as though you are mainly doing reads, which means
> >     you can
> >     > probably alter your cache settings to not promote so aggressively
> >     on reads,
> >     > as reads can be proxied through to the EC tier instead of
> >     promoting. This
> >     > should reduce the amount of required cache promotions.
> >     >
> >     > You are correct that reads have a lower priority of being cached,
> >     only when
> >     > they are used very frequently should this be done in an ideal
> >     situation.
> >     >
> >     >
> >     > Can you try setting min_read_recency_for promote to something
> higher?
> >     >
> >     > I looked into the setting before but I must admit it's exact
> >     purpose eludes me
> >     > still. Would it be correct to simplify it as
> >     'min_read_recency_for_promote
> >     > determines the amount of times a piece would have to be read in a
> >     certain
> >     > interval (set by hit_set_period) in order to promote it to the
> >     caching tier' ?
> >
> >     Yes that's correct. Every hit_set_period (assuming there is IO going
> >     on) a new hitset is created up until the hit_set_count limit. The
> >     recency defines how many of the last x hitsets an object must have
> >     been accessed in.
> >
> >     Tuning it is a bit of a dark art at the moment as you have to try
> >     and get all the values to match your workload. For starters try
> >     something like
> >
> >     Read recency =  2 or 3
> >     Hit_set_count =10
> >     Hit_set_period=60
> >
> >     Which will mean if an object is read more than 2 or 3 times in a row
> >     within the last few minutes it will be promoted. There is no
> >     granularity below a single hitset, so if an object gets hit a 1000
> >     times in 1 minute but then nothing for 5 minutes it will not cause a
> >     promotion.
> >
> >     >
> >     >
> >     > Also can you check what your hit_set_period and hit_set_count is
> >     currently
> >     > set to.
> >     >
> >     > hit_set_count is set to 1 and hit_set_period to 1800.
> >     >
> >     > What would increasing the hit_set_count do exactly?
> >     >
> >     >
> >     >
> >     > > Right now I'm unsure how to scientifically test the performance
> >     retrieving
> >     > > files when there is a cache miss. If somebody could point me
> >     towards a
> >     > > better way of doing that I would appreciate the help.
> >     > >
> >     > > An other thing is that I'm seeing a lot of messages popping up
> >     in dmesg on
> >     > > my client server on which the RBD volumes are mounted. (IPs
> removed)
> >     > >
> >     > > [685881.477383] libceph: osd50 :6800 socket closed (con state
OPEN)
> >     > > [685895.597733] libceph: osd54 :6808 socket closed (con state
OPEN)
> >     > > [685895.663971] libceph: osd54 :6808 socket closed (con state
OPEN)
> >     > > [685895.710424] libceph: osd54 :6808 socket closed (con state
OPEN)
> >     > > [685895.749417] libceph: osd54 :6808 socket closed (con state
OPEN)
> >     > > [685896.517778] libceph: osd54 :6808 socket closed (con state
OPEN)
> >     > > [685906.690445] libceph: osd74 :6824 socket closed (con state
OPEN)
> >     > >
> >     > > Is this a symptom of something?
> >     >
> >     > This is just stale connections to the OSD's timing out after the
> >     idle period and
> >     > is nothing to worry about.
> >     >
> >     > Glad to hear that, I was fearing something might be wrong.
> >     >
> >     > Thanks again.
> >     >
> >     > Peter
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com