On Wed, Mar 18, 2015 at 11:10 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > Hello, > > On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote: > >> On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> > Hi Greg, >> > >> > Thanks for your input and completely agree that we cannot expect >> > developers to fully document what impact each setting has on a >> > cluster, particularly in a performance related way >> > >> > That said, if you or others could spare some time for a few pointers it >> > would be much appreciated and I will endeavour to create some useful >> > results/documents that are more relevant to end users. >> > >> > I have taken on board what you said about the WB throttle and have been >> > experimenting with it by switching it on and off. I know it's a bit of >> > a blunt configuration change, but it was useful to understand its >> > effect. With it off, I do see initially quite a large performance >> > increase but overtime it actually starts to slow the average >> > throughput down. Like you said, I am guessing this is to do with it >> > making sure the journal doesn't get to far ahead, leaving it with >> > massive sync's to carry out. >> > >> > One thing I do see with the WBT enabled and to some extent with it >> > disabled, is that there are large periods of small block writes at the >> > max speed of the underlying sata disk (70-80iops). Here are 2 blktrace >> > seekwatcher traces of performing an OSD bench (64kb io's for 500MB) >> > where this behaviour can be seen. >> >> If you're doing 64k IOs then I believe it's creating a new on-disk >> file for each of those writes. How that's laid out on-disk will depend >> on your filesystem and the specific config options that we're using to >> try to avoid running too far ahead of the journal. >> > Could you elaborate on that a bit? > I would have expected those 64KB writes to go to the same object (file) > until it is full (4MB). > > Because this behavior would explain some (if not all) of the write > amplification I've seen in the past with small writes (see the "SSD > Hardware recommendation" thread). Ah, no, you're right. With the bench command it all goes in to one object, it's just a separate transaction for each 64k write. But again depending on flusher and throttler settings in the OSD, and the backing FS' configuration, it can be a lot of individual updates — in particular, every time there's a sync it has to update the inode. Certainly that'll be the case in the described configuration, with relatively low writeahead limits on the journal but high sync intervals — once you hit the limits, every write will get an immediate flush request. But none of that should have much impact on your write amplification tests unless you're actually using "osd bench" to test it. You're more likely to be seeing the overhead of the pg log entry, pg info change, etc that's associated with each write. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com