On Fri, 2010-07-30 at 15:06 -0700, Andrew Morton wrote: > On Fri, 30 Jul 2010 14:37:00 +0100 > Mel Gorman <mel@xxxxxxxxx> wrote: > > > There are a number of cases where pages get cleaned but two of concern > > to this patch are; > > o When dirtying pages, processes may be throttled to clean pages if > > dirty_ratio is not met. > > Ambiguous. I assume you meant "if dirty_ratio is exceeded". > > > o Pages belonging to inodes dirtied longer than > > dirty_writeback_centisecs get cleaned. > > > > The problem for reclaim is that dirty pages can reach the end of the LRU if > > pages are being dirtied slowly so that neither the throttling or a flusher > > thread waking periodically cleans them. > > > > Background flush is already cleaning old or expired inodes first but the > > expire time is too far in the future at the time of page reclaim. To mitigate > > future problems, this patch wakes flusher threads to clean 4M of data - > > an amount that should be manageable without causing congestion in many cases. > > > > Ideally, the background flushers would only be cleaning pages belonging > > to the zone being scanned but it's not clear if this would be of benefit > > (less IO) or not (potentially less efficient IO if an inode is scattered > > across multiple zones). > > > > Sigh. We have sooo many problems with writeback and latency. Read > https://bugzilla.kernel.org/show_bug.cgi?id=12309 and weep. Everyone's > running away from the issue and here we are adding code to solve some > alleged stack-overflow problem which seems to be largely a non-problem, > by making changes which may worsen our real problems. > > direct-reclaim wants to write a dirty page because that page is in the > zone which the caller wants to allcoate from! Telling the flusher > threads to perform generic writeback will sometimes cause them to just > gum the disk up with pages from different zones, making it even > harder/slower to allocate a page from the zones we're interested in, > no? > > If/when that happens, the problem will be rare, subtle, will take a > long time to get reported and will take years to understand and fix and > will probably be reported in the monster bug report which everyone's > hiding from anyway. There is that, and then there are issues with the VM simply lying to the filesystems. See https://bugzilla.kernel.org/show_bug.cgi?id=16056 Which basically boils down to the following: kswapd tells the filesystem that it is quite safe to do GFP_KERNEL allocations in pageouts and as part of try_to_release_page(). In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking' and 'for_reclaim' flags in the writeback_control struct, and so the filesystem has at least some hint that it should do non-blocking i/o. However if you trust the GFP_KERNEL flag in try_to_release_page() then the kernel can and will deadlock, and so I had to add in a hack specifically to tell the NFS client not to trust that flag if it comes from kswapd. Trond -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html