On Tue, 12 Nov 2019 17:02:36 -0800 Andrew Morton wrote: > > On Tue, 12 Nov 2019 11:42:27 +0800 Hillf Danton <hdanton@xxxxxxxx> wrote: > > > > The elastic bdi (ebdi) which is the mirror bdi of spinning disk, > > SSD and USB key on market is introduced to balancing dirty pages > > (bdp). > > > > The risk arises that system runs out of free memory, when dirty > > pages are produced too many too soon, so bdp is needed in field. > > > > Ebdi facilitates bdp in elastic time intervals e.g. from a jiffy > > to one HZ, depending on the time it would take to increase dirty > > pages by the amount which is defined by the variable > > ratelimit_pages. > > > > During cgroup writeback (cgwb) bdp, ebdi helps observe the > > changes both in cgwb's dirty pages (dirty speed) and in > > written-out pages (laundry speed) in elastic time intervals, > > until a balance is established between the two parties i.e. > > the two speeds statistically equal. > > > > The above mechanism of elastic equilibrium effectively prevents > > dirty page hogs, as no chance is left for dirty pages to pile up, > > thus cuts the risk that system free memory falls to unsafe level. > > > > Thanks to Rong Chen for testing. > > That sounds like a Tested-by: > Yes, Sir, will add Tested-by: Rong Chen <rong.a.chen@xxxxxxxxx> > The changelog has no testing results. Please prepare results which > show, amongst other things, the change in performance when the kernel > isn't tight on memory. As well as the alteration in behaviour when > memory is short. > Will do. > Generally, please work on making this code much more understandable? > Will do. > > > > ... > > > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd > > if (nr_pages == LONG_MAX) > > return LONG_MAX; > > > > + return nr_pages; > > + > > /* > > * This may be called on clean wb's and proportional distribution > > * may not make sense, just use the original @nr_pages in those > > @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct > > pages = min(pages, work->nr_pages); > > pages = round_down(pages + MIN_WRITEBACK_PAGES, > > MIN_WRITEBACK_PAGES); > > + pages = work->nr_pages; > > It's unclear what this is doing, but it makes the three preceding > statements non-operative. > This change, and the above one as well, is trying to bypass the current bandwidth, and a couple of rounds of cleanup are needed after it survives the LTP. > > } > > > > return pages; > > @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work) > > wb_wakeup_delayed(wb); > > > > current->flags &= ~PF_SWAPWRITE; > > + > > + if (waitqueue_active(&wb->bdp_waitq)) > > + wake_up_all(&wb->bdp_waitq); > > Please add a comment explaining why this is being done here. > After writing out some dirty pages, it it a check point to see if a balance is already set up between the dirty speed and laundry speed. Those under throttling will be unthrottled after seeing a balance in place. A comment will be added. > > } > > > > /* > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -1830,6 +1830,67 @@ pause: > > wb_start_background_writeback(wb); > > } > > > > +/** > > + * cgwb_bdp_should_throttle() tell if a wb should be throttled > > + * @wb bdi_writeback to throttle > > + * > > + * To avoid the risk of exhausting the system free memory, we check > > + * and try much to prevent too many dirty pages from being produced > > + * too soon. > > + * > > + * For cgroup writeback, it is essencially to keep an equilibrium > > "it is essential"? > Yes Sir. > > + * between its dirty speed and laundry speed i.e. dirty pages are > > + * written out as fast as they are produced in an ideal state. > > + */ > > +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb) > > +{ > > + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB }; > > + > > + if (fatal_signal_pending(current)) > > + return false; > > + > > + gdtc.avail = global_dirtyable_memory(); > > + > > + domain_dirty_limits(&gdtc); > > + > > + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) + > > + global_node_page_state(NR_UNSTABLE_NFS) + > > + global_node_page_state(NR_WRITEBACK); > > + > > + if (gdtc.dirty < gdtc.bg_thresh) > > + return false; > > + > > + if (!writeback_in_progress(wb)) > > + wb_start_background_writeback(wb); > > This is a bit ugly. Something called "bool cgwb_bdp_should_throttle()" > shoiuld just check whether we should throttle. But here it is, also > initiating writeback. That's an inappropriate thing for this function > to do? > It is the current bdp behavior trying to keep dirty pages below the user-configurable background threshold by waking up flushers, because no dirty page will be sent to disk without flusher's efforts, please see 143dfe8611a6 ("writeback: IO-less balance_dirty_pages()"). Will try to find some chance to pinch it out. > Also, we don't know *why* this is being done here, because there's no > code comment explaining the reasoning to us. > Will add a comment. > > > + if (gdtc.dirty < gdtc.thresh) > > + return false; > > + > > + /* > > + * throttle wb if there is the risk that wb's dirty speed is > > + * running away from its laundry speed, better with statistic > > + * error taken into account. > > + */ > > + return wb_stat(wb, WB_DIRTIED) > > > + wb_stat(wb, WB_WRITTEN) + wb_stat_error(); > > +} > > + > > > > ... > > > > @@ -1888,29 +1945,38 @@ void balance_dirty_pages_ratelimited(str > > * 1000+ tasks, all of them start dirtying pages at exactly the same > > * time, hence all honoured too large initial task->nr_dirtied_pause. > > */ > > - p = this_cpu_ptr(&bdp_ratelimits); > > - if (unlikely(current->nr_dirtied >= ratelimit)) > > - *p = 0; > > - else if (unlikely(*p >= ratelimit_pages)) { > > - *p = 0; > > - ratelimit = 0; > > - } > > + dirty = this_cpu_ptr(&bdp_ratelimits); > > + > > /* > > * Pick up the dirtied pages by the exited tasks. This avoids lots of > > * short-lived tasks (eg. gcc invocations in a kernel build) escaping > > * the dirty throttling and livelock other long-run dirtiers. > > */ > > - p = this_cpu_ptr(&dirty_throttle_leaks); > > - if (*p > 0 && current->nr_dirtied < ratelimit) { > > - unsigned long nr_pages_dirtied; > > - nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); > > - *p -= nr_pages_dirtied; > > - current->nr_dirtied += nr_pages_dirtied; > > + leak = this_cpu_ptr(&dirty_throttle_leaks); > > + > > + if (*dirty + *leak < ratelimit_pages) { > > + /* > > + * nothing to do as it would take some more time to > > + * eat out ratelimit_pages > > + */ > > + try_bdp = false; > > + } else { > > + try_bdp = true; > > + > > + /* > > + * bdp in flight helps detect dirty page hogs soon > > + */ > > How? Please expand on this comment a lot. > We should be cautious here in red zone after paying the ratelimit_pages price; we might soon have to tackle a deluge of dirty page hogs. Will cut it. > > + flights = this_cpu_ptr(&bdp_in_flight); > > + > > + if ((*flights)++ & 1) { > > What is that "& 1" doing? > It helps to tell if a bdp is alredy in flight. It would have been something like if (*flights == 0) { (*flights)++; } else { *flights = 0; > > + *dirty = *dirty + *leak - ratelimit_pages; > > + *leak = 0; > > + } but I was curious to see the flights in long run. Thanks Hillf > > } > > preempt_enable(); > > > > - if (unlikely(current->nr_dirtied >= ratelimit)) > > - balance_dirty_pages(wb, current->nr_dirtied); > > + if (try_bdp) > > + cgwb_bdp(wb); > > > > wb_put(wb);