Re: WriteBack Throttle kill the performace of the disk

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 14 Oct 2014 06:22:47 -0700 (PDT)

On Tue, 14 Oct 2014, Mark Nelson wrote:
> On 10/14/2014 12:15 AM, Nicheal wrote:
> > Yes, Greg.
> > But Unix based system always have a parameter dirty_ratio to prevent
> > the system memory from being exhausted. If Journal speed is so fast
> > while backing store cannot catch up with Journal, then the backing
> > store write will be blocked by the hard limitation of system dirty
> > pages. The problem here may be that system call, sync(), cannot return
> > since the system always has lots of dirty pages. Consequently, 1)
> > FileStore::sync_entry() will be timeout and then ceph_osd_daemon
> > abort.  2) Even if the thread is not timed out, Since the Journal
> > committed point cannot be updated so that the Journal will be blocked,
> > waiting for the sync() return and update Journal committed point.
> > So the Throttle is added to solve the above problems, right?
> 
> Greg or Sam can correct me if I'm wrong, but I always thought of the
> wbthrottle code as being more an attempt to smooth out spikes in write
> throughput to prevent the journal from getting too far ahead of the backing
> store.  IE have more frequent, shorter flush periods rather than less frequent
> longer ones.  For Ceph that is's probably a reasonable idea since you want all
> of the OSDs behaving as consistently as possible to prevent hitting the max
> outstanding client IOs/Bytes on the client and starving other ready OSDs.  I'm
> not sure it's worked out in practice as well as it might have in theory,
> though I'm not sure we've really investigated what's going on enough to be
> sure.

Right.  The fdatasync strategy means that the overall throughput is lower, 
but the latencies are much more consistent.  Without the throttling we had 
huge spikes, which is even more problematic.

> > However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it
> > will cause problem (SSD as journal, and HDD as data disk, fio 4k
> > ramdom write iodepth 64):
> >      WritebackThrottle enable: Based on blktrace, we trace the back-end
> > hdd io behaviour. Because of frequently calling fdatasync() in
> > Writeback Throttle, it cause every back-end hdd spent more time to
> > finish one io. This causes the total sync time longer. For example,
> > default sync_max_interval is 5 seconds, total dirty data in 5 seconds
> > is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to
> > disk within 4 second, So cat /proc/meminfo, the dirty data of my
> > system is always clean(near zero). However, If I enable
> > WritebackThrottle, fdatasync() slows down the sync process. Thus, it
> > seems 8-9M random io will be sync to the disk within 5s. Thus the
> > dirty data is always growing to the critical point (system
> > up-limitation), and then sync_entry() is always timed out. So I means,
> > in my case, disabling WritebackThrottle, I may always have 600 IOPS.
> > If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync
> > cause back-end HDD disk overloaded.

It is true.  One could probably disable wbthrottle and carefully tune the 
kernel dirty_ratio and dirty_bytes.  As I recall the problem though was 
that it was inode writeback that was expensive, and there were not good 
kernel knobs for limiting the dirty items in that cache. I would be very 
interested in hearing about successes in this area.

Another promising direction is the batched fsync experiment that Dave 
Chinner did a few months back.  I'm not what the status is in 
getting that into mainline, though, so it's not helpful anytime soon.

> >     So I would like that we can dynamically throttle the IOPS in
> > FileStore. We cannot know the average sync() speed of the back-end
> > Store since different disk own different IO performance. However, we
> > can trace the average write speed in FileStore and Journal, Also, we
> > can know, whether start_sync() is return and finished. Thus, If this
> > time, Journal is writing so fast that the back-end cannot catch up the
> > Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g.
> > 800IOPS/s) in next operation interval(the interval maybe 1 to 5
> > seconds, in the third second, Thottle become 1000*e^-x where x is the
> > tick interval, ), if in this interval, Journal write reach the
> > limitation, the following submitting write should waiting in OSD
> > waiting queue.So in this way, Journal may provide a boosting IO, but
> > finally, back-end sync() will return and catch up with Journal become
> > we always slow down the Journal speed after several seconds.

Autotuning these parameters based on observed performance definitely 
sounds promising!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html