On 2017/9/9 上午12:42, Michael Lyle wrote: > [sorry for resend, I am apparently not good at reply-all in gmail :P ] > > On Thu, Sep 7, 2017 at 10:52 PM, Coly Li <colyli@xxxxxxx> wrote: > [snip history] >> writeback_rate_mininum & writeback_rate are all readable/writable, and >> writeback_rate_mininum should be less or equal to writeback_rate if I >> understand correctly. > > No, this is not true. writeback_rate is writable, but the control > system replaces it at 5 second intervals. This is the same as current > code. If you want writeback_rate to do something as a tunable, you > should set writeback_percent to 0, which disables the control system > and lets you set your own value-- otherwise whatever change you make > is replaced in 5 seconds. > > writeback_rate_minimum is for use cases when you want to force > writeback_rate to occur faster than the control system would choose on > its own. That is, imagine you have an intermittent, write-heavy > workload, and when the system is idle you want to clear out the dirty > blocks. The default rate of 1 sector per second would do this very > slowly-- instead you could pick a value that is a small percentage of > disk bandwidth (preserving latency characteristics) but still fast > enough to leave dirty space available. > >> Here I feel a check should be added here to make sure >> writeback_rate_minimum <= writeback_rate when setting them into sysfs entry. > > You usually (not always) will actually want to set > writeback_rate_minimum to faster than writeback_rate, to speed up the > current writeback rate. This assumption is not always correct. If heavy front end I/Os coming every "writeback_rate_update_seconds" seconds, the writeback rate just raises to a high number, this situation may have negative contribution to I/O latency of front end I/Os. It may not be exact "writeback_rate_update_seconds" seconds, this is just an example for some king of "interesting" I/O pattern to show that higher writeback_rate_minimum may not always be helpful. > >>> + if ((error < 0 && dc->writeback_rate_integral > 0) || >>> + (error > 0 && time_before64(local_clock(), >>> + dc->writeback_rate.next + NSEC_PER_MSEC))) { >>> + /* Only decrease the integral term if it's more than >>> + * zero. Only increase the integral term if the device >>> + * is keeping up. (Don't wind up the integral >>> + * ineffectively in either case). >>> + * >>> + * It's necessary to scale this by >>> + * writeback_rate_update_seconds to keep the integral >>> + * term dimensioned properly. >>> + */ >>> + dc->writeback_rate_integral += error * >>> + dc->writeback_rate_update_seconds; >> >> I am not sure whether it is correct to calculate a integral value here. >> error here is not a per-second value, it is already a accumulated result >> in past "writeback_rate_update_seconds" seconds, what does it mean for >> "error * dc->writeback_rate_update_seconds" ? >> >> I know here you are calculating a integral value of error, but before I >> understand why you use "error * dc->writeback_rate_update_seconds", I am >> not able to say whether it is good or not. > > The calculation occurs every writeback_rate_update_seconds. An > integral is the area under a curve. > > If the error is currently 1, and has been 1 for the past 5 seconds, > the integral increases by 1 * 5 seconds. There are two approaches > used in numerical integration, a "rectangular integration" (which this > is, assuming the value has held for the last 5 seconds), and a > "triangular integration", where the average of the old value and the > new value are averaged and multiplied by the measurement interval. It > doesn't really make a difference-- the triangular integration tends to > come up with a slightly more accurate value but adds some delay. (In > this case, the integral has a time constant of thousands of > seconds...) > Hmm, imagine we have a per-second sampling, and the data is: time point dirty data (MB) 1 1 1 1 1 1 1 1 1 10 Then a more accurate integral result should be: 1+1+1+1+10 = 14. But by your "rectangular integration" the result will be 10*5 = 50. Correct me if I am wrong, IMHO 14:50 is a big difference. >> In my current understanding, the effect of the above calculation is to >> make a derivative value being writeback_rate_update_seconds times big. >> So it is expected to be faster than current PD controller. > > The purpose of the proportional term is to respond immediately to how > full the buffer is (this isn't a derivative value). > > If we consider just the proportional term alone, with its default > value of 40, and the user starts writing 1000 sectors/second... > eventually error will reach 40,000, which will cause us to write 1000 > blocks per second and be in equilibrium-- but the amount filled with > dirty data will be off by 40,000 blocks from the user's calibrated > value. The integral term works to take a long term average of the > error and adjust the write rate, to bring the value back precisely to > its setpoint-- and to allow a good writeback rate to be chosen for > intermittent loads faster than its time constant. > >> I see 5 sectors/second is faster than 1 sectors/second, is there any >> other benefit to change 1 to 5 ? > > We can set this back to 1 if you want. It is still almost nothing, > and in practice more will be written in most cases (the scheduling > targeting writing 1/second usually has to write more). > 1 is the minimum writeback rate, even there is heavy front end I/O, bcache still tries to writeback at 1 sectors/second. Let's keep it in 1, so give the maximum bandwidth to frond end I/Os for better latency and throughput. >>> + dc->writeback_rate_p_term_inverse = 40; >>> + dc->writeback_rate_i_term_inverse = 10000; >> >> How the above values are selected ? Could you explain the calculation >> behind the values ? > > Sure. 40 is to try and write at a rate to retire the current blocks > at 40 seconds. It's the "fast" part of the control system, and needs > to not be too fast to not overreact to single writes. (e.g. if the > system is quiet, and at the setpoint, and the user writes 4000 blocks > once, the P controller will try and write at an initial rate of 100 > blocks/second). The i term is more complicated-- I made it very slow. > It should usually be more than the p term squared * the calculation > interval for stability, but there may be some circumstances when you > want its control to be more effective than this. The lower the i term > is, the quicker the system will come back to the setpoint, but the > more potential there is for overshoot (moving past the setpoint) and > oscillation. > > To take a numerical example with the case above, where the P term > would end up off by 40,000 blocks, each 5 second update the I > controller would be increasing the rate by 20 blocks/second initially > to bring that 40,000 block offset under control Oh, I see. It seems what we need is just benchmark numbers for latency distribution. Once there is no existed data, I will get a data set by myself. I can arrange to start the test by end of this month, now I don't have continuous access to a powerful hardware. Thanks for the above information :-) -- Coly Li -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html