On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote: > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote: > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote: > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > > > > If congestion_wait() is called when there is no congestion, the caller > > > > > will wait for the full timeout. This can cause unreasonable and > > > > > unnecessary stalls. There are a number of potential modifications that > > > > > could be made to wake sleepers but this patch measures how serious the > > > > > problem is. It keeps count of how many congested BDIs there are. If > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will > > > > > record that the wait was unnecessary. > > > > > > > > I am not convinced that unnecessary is the right word. On a workload > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > > > > the VM regressing both in time and in reclaiming the right pages when > > > > simply removing congestion_wait() from the direct reclaim paths (the > > > > one in __alloc_pages_slowpath and the other one in > > > > do_try_to_free_pages). > > > > > > > > So just being stupid and waiting for the timeout in direct reclaim > > > > while kswapd can make progress seemed to do a better job for that > > > > load. > > > > > > > > I can not exactly pinpoint the reason for that behaviour, it would be > > > > nice if somebody had an idea. > > > > > > > > > > There is a possibility that the behaviour in that case was due to flusher > > > threads doing the writes rather than direct reclaim queueing pages for IO > > > in an inefficient manner. So the stall is stupid but happens to work out > > > well because flusher threads get the chance to do work. > > > > The workload was accessing a large sparse-file through mmap, so there > > wasn't much IO in the first place. > > > > Then waiting on congestion was the totally wrong thing to do. We were > effectively calling sleep(HZ/10) and magically this was helping in some > undefined manner. Do you know *which* called of congestion_wait() was > the most important to you? Removing congestion_wait() in do_try_to_free_pages() definitely worsens reclaim behaviour for this workload: 1. wallclock time of the testrun increases by 11% 2. the scanners do a worse job and go for the wrong zone: -pgalloc_dma 79597 -pgalloc_dma32 134465902 +pgalloc_dma 297089 +pgalloc_dma32 134247237 -pgsteal_dma 77501 -pgsteal_dma32 133939446 +pgsteal_dma 294998 +pgsteal_dma32 133722312 -pgscan_kswapd_dma 145897 -pgscan_kswapd_dma32 266141381 +pgscan_kswapd_dma 287981 +pgscan_kswapd_dma32 186647637 -pgscan_direct_dma 9666 -pgscan_direct_dma32 1758655 +pgscan_direct_dma 302495 +pgscan_direct_dma32 80947179 -pageoutrun 1768531 -allocstall 614 +pageoutrun 1927451 +allocstall 8566 I attached the full vmstat contents below. Also the test program, which I ran in this case as: ./mapped-file-stream 1 $((512 << 30)) > > > > So personally I think it's a good idea to get an insight on the use of > > > > congestion_wait() [patch 1] but I don't agree with changing its > > > > behaviour just yet, or judging its usefulness solely on whether it > > > > correctly waits for bdi congestion. > > > > > > > > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during > > > IO (one of which involved no writes) were due to calling congestion_wait > > > and waiting the full timeout where no writes are going on. > > > > Oh, I am in full agreement here! Removing those congestion_wait() as > > described above showed a reduction in peak latency. The dilemma is > > only that it increased the overall walltime of the load. > > > > Do you know why because leaving in random sleeps() hardly seems to be > the right approach? I am still trying to find out what's going wrong. > > And the scanning behaviour deteriorated, as in having increased > > scanning pressure on other zones than the unpatched kernel did. > > > > Probably because it was scanning more but not finding what it needed. > There is a condition other than congestion it is having trouble with. In > some respects, I think if we change congestion_wait() as I propose, > we may see a case where CPU usage is higher because it's now > encountering the unspecified reclaim problem we have. Exactly. > > So I think very much that we need a fix. congestion_wait() causes > > stalls and relying on random sleeps for the current reclaim behaviour > > can not be the solution, at all. > > > > I just don't think we can remove it based on the argument that it > > doesn't do what it is supposed to do, when it does other things right > > that it is not supposed to do ;-) > > > > We are not removing it, we are just stopping it going to sleep for > stupid reasons. If we find that wall time is increasing as a result, we > have a path to figuring out what the real underlying problem is instead > of sweeping it under the rug. Well, for that testcase it is in effect the same as a removal as there's never congestion. But again: I agree with your changes per-se, I just don't think they should get merged as long as they knowingly catalyze a problem that has yet to be identified.
#include <sys/types.h> #include <sys/mman.h> #include <sys/wait.h> #include <limits.h> #include <signal.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <stdio.h> static int start_process(unsigned long nr_bytes) { char filename[] = "/tmp/clog-XXXXXX"; unsigned long i; char *map; int fd; fd = mkstemp(filename); unlink(filename); if (fd == -1) { perror("mkstemp()"); return -1; } if (ftruncate(fd, nr_bytes)) { perror("ftruncate()"); return -1; } map = mmap(NULL, nr_bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { perror("mmap()"); return -1; } if (madvise(map, nr_bytes, MADV_RANDOM)) { perror("madvise()"); return -1; } kill(getpid(), SIGSTOP); for (i = 0; i < nr_bytes; i += 4096) ((volatile char *)map)[i]; close(fd); return 0; } static int do_test(unsigned long nr_procs, unsigned long nr_bytes) { pid_t procs[nr_procs]; unsigned long i; int dummy; for (i = 0; i < nr_procs; i++) { switch ((procs[i] = fork())) { case -1: kill(0, SIGKILL); perror("fork()"); return -1; case 0: return start_process(nr_bytes); default: waitpid(procs[i], &dummy, WUNTRACED); break; } } kill(0, SIGCONT); for (i = 0; i < nr_procs; i++) waitpid(procs[i], &dummy, 0); return 0; } static int xstrtoul(const char *str, unsigned long *valuep) { unsigned long value; char *endp; value = strtoul(str, &endp, 0); if (*endp || (value == ULONG_MAX && errno == ERANGE)) return -1; *valuep = value; return 0; } int main(int ac, char **av) { unsigned long nr_procs, nr_bytes; if (ac != 3) goto usage; if (xstrtoul(av[1], &nr_procs)) goto usage; if (xstrtoul(av[2], &nr_bytes)) goto usage; setbuf(stdout, NULL); setbuf(stderr, NULL); return !!do_test(nr_procs, nr_bytes); usage: fprintf(stderr, "usage: %s nr_procs nr_bytes\n", av[0]); return 1; }
Attachment:
vmstat.a.2
Description: Unix manual page
Attachment:
vmstat.b.2
Description: Unix manual page