Re: Cache Tier Flush = immediate base tier journal sync?

Nick Fisk <nick@xxxxxxxxxx> · Wed, 18 Mar 2015 15:04:02 -0000

Hi Greg,

Thanks for your input and completely agree that we cannot expect developers
to fully document what impact each setting has on a cluster, particularly in
a performance related way

That said, if you or others could spare some time for a few pointers it
would be much appreciated and I will endeavour to create some useful
results/documents that are more relevant to end users.

I have taken on board what you said about the WB throttle and have been
experimenting with it by switching it on and off. I know it's a bit of a
blunt configuration change, but it was useful to understand its effect. With
it off, I do see initially quite a large performance increase but overtime
it actually starts to slow the average throughput down. Like you said, I am
guessing this is to do with it making sure the journal doesn't get to far
ahead, leaving it with massive sync's to carry out.

One thing I do see with the WBT enabled and to some extent with it disabled,
is that there are large periods of small block writes at the max speed of
the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces
of performing an OSD bench (64kb io's for 500MB) where this behaviour can be
seen.

http://www.sys-pro.co.uk/misc/wbt_on.png

http://www.sys-pro.co.uk/misc/wbt_off.png

I would really appreciate if someone could comment on why this type of
behaviour happens? As can be seen in the trace, if the blocks are submitted
to the disk as larger IO's and with higher concurrency, hundreds of Mb of
data can be flushed in seconds. Is this something specific to the filesystem
behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes
which can't be merged into larger IO's? 

For sequential writes, I would have thought that in an optimum scenario, a
spinning disk should be able to almost maintain its large block write speed
(100MB/s) no matter the underlying block size. That being said, from what I
understand when a sync is called it will try and flush all dirty data so the
end result is probably slightly different to a traditional battery backed
write back cache.

Chris, would you be interested in forming a ceph-users based performance
team? There's a developer performance meeting which is mainly concerned with
improving the internals of Ceph. There is also a raft of information on the
mailing list archives where people have said "hey look at my SSD speed at
x,y,z settings", but making comparisons or recommendations is not that easy.
It may also reduce a lot of the repetitive posts of "why is X so
slow....etc"

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Gregory Farnum
> Sent: 16 March 2015 23:57
> To: Christian Balzer
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Cache Tier Flush = immediate base tier journal
> sync?
> 
> On Mon, Mar 16, 2015 at 4:46 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote:
> >
> >> Nothing here particularly surprises me. I don't remember all the
> >> details of the filestore's rate limiting off the top of my head, but
> >> it goes to great lengths to try and avoid letting the journal get too
> >> far ahead of the backing store. Disabling the filestore flusher and
> >> increasing the sync intervals without also increasing the
> >> filestore_wbthrottle_* limits is not going to work well for you.
> >> -Greg
> >>
> > While very true and what I recalled (backing store being kicked off
> > early) from earlier mails, I think having every last configuration
> > parameter documented in a way that doesn't reduce people to guesswork
> > would be very helpful.
> 
> PRs welcome! ;)
> 
> More seriously, we create a lot of config options and it's not always
clear
> when doing so which ones should be changed by users or not. And a lot of
> them (case in point: anything to do with changing journal and FS
interactions)
> should only be changed by people who really understand them, because it's
> possible (as evidenced) to really bust up your cluster's performance
enough
> that it's basically broken.
> Historically that's meant "people who can read the code and understand
it",
> although we might now have enough people at a mid-line that it's worth
> going back and documenting. There's not a lot of pressure coming from
> anybody to do that work in comparison to other stuff like "make CephFS
> supported" and "make RADOS faster" though, for understandable reasons.
> So while we can try and document these things some in future, the names of
> things here are really pretty self-explanatory and the sort of
configuration
> reference guide  I think you're asking for (ie, "here are all the settings
to
> change if you are running on SSDs, and here's how they're related") is not
> the kind of thing that developers produce. That comes out of the community
> or is produced by support contracts.
> 
> ...so I guess I've circled back around to "PRs welcome!"
> 
> > For example "filestore_wbthrottle_xfs_inodes_start_flusher" which
> > defaults to 500.
> > Assuming that this means to start flushing once 500 inodes have
> > accumulated, how would Ceph even know how many inodes are needed
> for
> > the data present?
> 
> Number of dirtied objects, of course.
> 
> >
> > Lastly with these parameters, there is xfs and btrfs incarnations, no
> > ext4.
> > Do the xfs parameters also apply to ext4?
> 
> Uh, looks like it does, but I'm just skimming source right now so you
should
> check if you change these params. :) -Greg
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com