On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > Hi Greg, > > Thanks for your input and completely agree that we cannot expect developers > to fully document what impact each setting has on a cluster, particularly in > a performance related way > > That said, if you or others could spare some time for a few pointers it > would be much appreciated and I will endeavour to create some useful > results/documents that are more relevant to end users. > > I have taken on board what you said about the WB throttle and have been > experimenting with it by switching it on and off. I know it's a bit of a > blunt configuration change, but it was useful to understand its effect. With > it off, I do see initially quite a large performance increase but overtime > it actually starts to slow the average throughput down. Like you said, I am > guessing this is to do with it making sure the journal doesn't get to far > ahead, leaving it with massive sync's to carry out. > > One thing I do see with the WBT enabled and to some extent with it disabled, > is that there are large periods of small block writes at the max speed of > the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces > of performing an OSD bench (64kb io's for 500MB) where this behaviour can be > seen. If you're doing 64k IOs then I believe it's creating a new on-disk file for each of those writes. How that's laid out on-disk will depend on your filesystem and the specific config options that we're using to try to avoid running too far ahead of the journal. I think you're just using these config options in conflict with eachother. You've set the min sync time to 20 seconds for some reason, presumably to try and batch stuff up? So in that case you probably want to let your journal run for twenty seconds worth of backing disk IO before you start throttling it, and probably 10-20 seconds worth of IO before forcing file flushes. That means increasing the throttle limits while still leaving the flusher enabled. -Greg > > http://www.sys-pro.co.uk/misc/wbt_on.png > > http://www.sys-pro.co.uk/misc/wbt_off.png > > I would really appreciate if someone could comment on why this type of > behaviour happens? As can be seen in the trace, if the blocks are submitted > to the disk as larger IO's and with higher concurrency, hundreds of Mb of > data can be flushed in seconds. Is this something specific to the filesystem > behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes > which can't be merged into larger IO's? > > For sequential writes, I would have thought that in an optimum scenario, a > spinning disk should be able to almost maintain its large block write speed > (100MB/s) no matter the underlying block size. That being said, from what I > understand when a sync is called it will try and flush all dirty data so the > end result is probably slightly different to a traditional battery backed > write back cache. > > Chris, would you be interested in forming a ceph-users based performance > team? There's a developer performance meeting which is mainly concerned with > improving the internals of Ceph. There is also a raft of information on the > mailing list archives where people have said "hey look at my SSD speed at > x,y,z settings", but making comparisons or recommendations is not that easy. > It may also reduce a lot of the repetitive posts of "why is X so > slow....etc" _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com