On 10/14/2014 02:19 PM, Mark Nelson wrote: > On 10/14/2014 12:15 AM, Nicheal wrote: >> Yes, Greg. >> But Unix based system always have a parameter dirty_ratio to prevent >> the system memory from being exhausted. If Journal speed is so fast >> while backing store cannot catch up with Journal, then the backing >> store write will be blocked by the hard limitation of system dirty >> pages. The problem here may be that system call, sync(), cannot return >> since the system always has lots of dirty pages. Consequently, 1) >> FileStore::sync_entry() will be timeout and then ceph_osd_daemon >> abort. 2) Even if the thread is not timed out, Since the Journal >> committed point cannot be updated so that the Journal will be blocked, >> waiting for the sync() return and update Journal committed point. >> So the Throttle is added to solve the above problems, right? > > Greg or Sam can correct me if I'm wrong, but I always thought of the > wbthrottle code as being more an attempt to smooth out spikes in write > throughput to prevent the journal from getting too far ahead of the > backing store. IE have more frequent, shorter flush periods rather than > less frequent longer ones. For Ceph that is's probably a reasonable > idea since you want all of the OSDs behaving as consistently as possible > to prevent hitting the max outstanding client IOs/Bytes on the client > and starving other ready OSDs. I'm not sure it's worked out in practice > as well as it might have in theory, though I'm not sure we've really > investigated what's going on enough to be sure. > I thought that as well. So in the case of a SSD-based OSD where the journal is on a partition #1 and the data on #2 you would disable wbthrottle, correct? Since the journal is just as fast as the data partition. >> However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it >> will cause problem (SSD as journal, and HDD as data disk, fio 4k >> ramdom write iodepth 64): >> WritebackThrottle enable: Based on blktrace, we trace the back-end >> hdd io behaviour. Because of frequently calling fdatasync() in >> Writeback Throttle, it cause every back-end hdd spent more time to >> finish one io. This causes the total sync time longer. For example, >> default sync_max_interval is 5 seconds, total dirty data in 5 seconds >> is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to >> disk within 4 second, So cat /proc/meminfo, the dirty data of my >> system is always clean(near zero). However, If I enable >> WritebackThrottle, fdatasync() slows down the sync process. Thus, it >> seems 8-9M random io will be sync to the disk within 5s. Thus the >> dirty data is always growing to the critical point (system >> up-limitation), and then sync_entry() is always timed out. So I means, >> in my case, disabling WritebackThrottle, I may always have 600 IOPS. >> If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync >> cause back-end HDD disk overloaded. > > We never did a blktrace investigation, but we did see pretty bad > performance with the default wbthrottle code when it was first > implemented. We ended up raising the throttles pretty considerably in > dumpling RC2. It would be interesting to repeat this test on an Intel > system. > >> So I would like that we can dynamically throttle the IOPS in >> FileStore. We cannot know the average sync() speed of the back-end >> Store since different disk own different IO performance. However, we >> can trace the average write speed in FileStore and Journal, Also, we >> can know, whether start_sync() is return and finished. Thus, If this >> time, Journal is writing so fast that the back-end cannot catch up the >> Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g. >> 800IOPS/s) in next operation interval(the interval maybe 1 to 5 >> seconds, in the third second, Thottle become 1000*e^-x where x is the >> tick interval, ), if in this interval, Journal write reach the >> limitation, the following submitting write should waiting in OSD >> waiting queue.So in this way, Journal may provide a boosting IO, but >> finally, back-end sync() will return and catch up with Journal become >> we always slow down the Journal speed after several seconds. >> > > I will wait for Sam's input, but it seems reasonable to me. Perhaps you > might write it up as a blueprint for CDS? > > Mark > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html