Re: Cache Tier Flush = immediate base tier journal sync?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote:

> On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > Hi Greg,
> >
> > Thanks for your input and completely agree that we cannot expect
> > developers to fully document what impact each setting has on a
> > cluster, particularly in a performance related way
> >
> > That said, if you or others could spare some time for a few pointers it
> > would be much appreciated and I will endeavour to create some useful
> > results/documents that are more relevant to end users.
> >
> > I have taken on board what you said about the WB throttle and have been
> > experimenting with it by switching it on and off. I know it's a bit of
> > a blunt configuration change, but it was useful to understand its
> > effect. With it off, I do see initially quite a large performance
> > increase but overtime it actually starts to slow the average
> > throughput down. Like you said, I am guessing this is to do with it
> > making sure the journal doesn't get to far ahead, leaving it with
> > massive sync's to carry out.
> >
> > One thing I do see with the WBT enabled and to some extent with it
> > disabled, is that there are large periods of small block writes at the
> > max speed of the underlying sata disk (70-80iops). Here are 2 blktrace
> > seekwatcher traces of performing an OSD bench (64kb io's for 500MB)
> > where this behaviour can be seen.
> 
> If you're doing 64k IOs then I believe it's creating a new on-disk
> file for each of those writes. How that's laid out on-disk will depend
> on your filesystem and the specific config options that we're using to
> try to avoid running too far ahead of the journal.
> 
Could you elaborate on that a bit?
I would have expected those 64KB writes to go to the same object (file)
until it is full (4MB).

Because this behavior would explain some (if not all) of the write
amplification I've seen in the past with small writes (see the "SSD
Hardware recommendation" thread).

Christian

> I think you're just using these config options in conflict with
> eachother. You've set the min sync time to 20 seconds for some reason,
> presumably to try and batch stuff up? So in that case you probably
> want to let your journal run for twenty seconds worth of backing disk
> IO before you start throttling it, and probably 10-20 seconds worth of
> IO before forcing file flushes. That means increasing the throttle
> limits while still leaving the flusher enabled.
> -Greg
> 
> >
> > http://www.sys-pro.co.uk/misc/wbt_on.png
> >
> > http://www.sys-pro.co.uk/misc/wbt_off.png
> >
> > I would really appreciate if someone could comment on why this type of
> > behaviour happens? As can be seen in the trace, if the blocks are
> > submitted to the disk as larger IO's and with higher concurrency,
> > hundreds of Mb of data can be flushed in seconds. Is this something
> > specific to the filesystem behaviour which Ceph cannot influence, like
> > dirty filesystem metadata/inodes which can't be merged into larger
> > IO's?
> >
> > For sequential writes, I would have thought that in an optimum
> > scenario, a spinning disk should be able to almost maintain its large
> > block write speed (100MB/s) no matter the underlying block size. That
> > being said, from what I understand when a sync is called it will try
> > and flush all dirty data so the end result is probably slightly
> > different to a traditional battery backed write back cache.
> >
> > Chris, would you be interested in forming a ceph-users based
> > performance team? There's a developer performance meeting which is
> > mainly concerned with improving the internals of Ceph. There is also a
> > raft of information on the mailing list archives where people have
> > said "hey look at my SSD speed at x,y,z settings", but making
> > comparisons or recommendations is not that easy. It may also reduce a
> > lot of the repetitive posts of "why is X so slow....etc"
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux