On Wed, Sep 23, 2020 at 09:33:18PM +0800, Yafang Shao wrote: > Our users reported that there're some random latency spikes when their RT > process is running. Finally we found that latency spike is caused by > FADV_DONTNEED. Which may call lru_add_drain_all() to drain LRU cache on > remote CPUs, and then waits the per-cpu work to complete. The wait time > is uncertain, which may be tens millisecond. > That behavior is unreasonable, because this process is bound to a > specific CPU and the file is only accessed by itself, IOW, there should > be no pagecache pages on a per-cpu pagevec of a remote CPU. That > unreasonable behavior is partially caused by the wrong comparation of the > number of invalidated pages and the number of the target. For example, > if (count < (end_index - start_index + 1)) > The count above is how many pages were invalidated in the local CPU, and > (end_index - start_index + 1) is how many pages should be invalidated. > The usage of (end_index - start_index + 1) is incorrect, because they > are virtual addresses, which may not mapped to pages. Besides that, > there may be holes between start and end. So we'd better check whether > there are still pages on per-cpu pagevec after drain the local cpu, and > then decide whether or not to call lru_add_drain_all(). > > After I applied it with a hotfix to our production environment, most of > the lru_add_drain_all() can be avoided. > > Suggested-by: Mel Gorman <mgorman@xxxxxxx> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> I think that's ok. Does it succeed with the original test case from the commit that introduced the behaviour and one modified to truncate part of the mapping? -- Mel Gorman SUSE Labs