Re: Page Cache writeback too slow, SSD/noop scheduler/ext2

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Wed, 25 Mar 2009 13:26:37 +0800

On Wed, Mar 25, 2009 at 01:48:53AM +1100, Nick Piggin wrote:
> On Monday 23 March 2009 03:53:29 Jos Houtman wrote:
> > On 3/21/09 11:53 AM, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > On Fri, 20 Mar 2009 19:26:06 +0100 Jos Houtman <jos@xxxxxxxx> wrote:
> > >> Hi,
> > >>
> > >> We have hit a problem where the page-cache writeback algorithm is not
> > >> keeping up.
> > >> When memory gets low this will result in very irregular performance
> > >> drops.
> > >>
> > >> Our setup is as follows:
> > >> 30 x Quad core machine with 64GB ram.
> > >> These are single purpose machines running MySQL.
> > >> Kernel version: 2.6.28.7
> > >> A dedicated SSD drive for the ext2 database partition
> > >> Noop scheduler for the ssd drive.
> > >>
> > >>
> > >> The current hypothesis is as follows:
> > >> The wk_update function does not write enough dirty pages, which allows
> > >> the number of dirty pages to grow to the dirty_background limit.
> > >> When memory is low,  __background_writeout() comes around and
> > >> __forcefully__ writes dirty pages to disk.
> > >> This forced write fills the disk queue and starves read calls that MySQL
> > >> is trying to do: basically killing performance  for a few seconds. This
> > >> pattern repeats as soon as the cleared memory is filled again.
> > >>
> > >> Decreasing the dirty_writeback_centisecs to 100 doesn__t help
> > >>
> > >> I don__t know why this is, but I did some preliminary tracing using
> > >> systemtap and it seems that the majority of times wk_update calls
> > >> decides to do nothing.
> > >>
> > >> Doubling /sys/block/sdb/queue/nr_requests  to 256, seems to help abit: 
> > >> the nr_dirty pages is increasing more slowly.
> > >> But I am unsure of side-effects and am afraid of increasing the
> > >> starvation problem for mysql.
> > >>
> > >>
> > >> I__am very much willing to work on this issue and see it fixed, but
> > >> would like to tap into the knowledge of people here.
> > >> So:
> > >> * Have more people seen this or simular issues?
> > >> * Is the hypothesis above a viable one?
> > >> * Suggestions/pointers for further research and statistics I should
> > >> measure to improve the understanding of this problem.
> > >
> > > I don't think that noop-iosched tries to do anything to prevent
> > > writes-starve-reads.  Do you get better behaviour from any of the other
> > > IO schedulers?
> >
> > I did a quick stress test and cfq does not immediately seem to hurt
> > performance, although some of my colleague's have tested this in the past
> > with the opposite results (which is why we use noop).
> >
> > But despite the scheduler, the real problem is in the writeback algorithm
> > not keeping up.
> > We can grow 600K dirty pages during the day, and only ~300k is flushed to
> > disk during the night hours.
> >
> > While a quick look at the writeback algorithm let me to expect
> > __wk_update()__ to flush ~1024 pages every 5 seconds, which is almost 3GB
> > per hour.  It obviously does not manage to do this in our setup.
> >
> > I don¹t believe the speed of the ssd to be the problem, running sync
> > manually only takes a few minutes to flush 800K dirty pages to disk.
> 
> kupdate surely should just continue to keep trying to write back pages
> so long as there are more old pages to clean, and the queue isn't
> congested. That seems to be the intention anyway: MAX_WRITEBACK_PAGES
> is just the number to write back in a single call, but you see
> nr_to_write is set to the number of dirty pages in the system.
> 
> On your system, what must be happening is more_io is not being set.
> The logic in fs/fs-writeback.c might be busted.

Hi Jos,

I prepared a debugging patch for 2.6.28. (I cannot observe writeback
problems on my local ext2 mount.)

You can view the states of all dirty inodes by doing

        modprobe filecache
        echo ls dirty > /proc/filecache
        cat /proc/filecache

The 'age' field shows (jiffies - inode->dirtied_when), which may also be useful
for debugging Jeff and Ian's case(if it keeps growing, then dirtied_when is stuck).

The detailed dirty writeback traces can be retrieved by doing

        echo 1 > /proc/sys/fs/dirty_debug        
        sleep 6s
        echo 0 > /proc/sys/fs/dirty_debug        
        dmesg

The dmesg trace should help identify the bug in periodic writeback.

Thanks,
Fengguang

--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -27,6 +27,7 @@ extern unsigned long max_mapnr;
 extern unsigned long num_physpages;
 extern void * high_memory;
 extern int page_cluster;
+extern char * const zone_names[];
 
 #ifdef CONFIG_SYSCTL
 extern int sysctl_legacy_va_layout;
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -104,7 +104,7 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
 
 EXPORT_SYMBOL(totalram_pages);
 
-static char * const zone_names[MAX_NR_ZONES] = {
+char * const zone_names[MAX_NR_ZONES] = {
 #ifdef CONFIG_ZONE_DMA
 	 "DMA",
 #endif
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1943,7 +1943,10 @@ char *__d_path(const struct path *path, 
 
 		if (dentry == root->dentry && vfsmnt == root->mnt)
 			break;
-		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
+		if (unlikely(!vfsmnt)) {
+			if (IS_ROOT(dentry))
+				break;
+		} else if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			/* Global root? */
 			if (vfsmnt->mnt_parent == vfsmnt) {
 				goto global_root;
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -564,7 +564,6 @@ out:
 }
 EXPORT_SYMBOL(radix_tree_tag_clear);
 
-#ifndef __KERNEL__	/* Only the test harness uses this at present */
 /**
  * radix_tree_tag_get - get a tag on a radix tree node
  * @root:		radix tree root
@@ -627,7 +626,6 @@ int radix_tree_tag_get(struct radix_tree
 	}
 }
 EXPORT_SYMBOL(radix_tree_tag_get);
-#endif
 
 /**
  *	radix_tree_next_hole    -    find the next hole (not-present entry)
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -82,6 +82,10 @@ static struct hlist_head *inode_hashtabl
  */
 DEFINE_SPINLOCK(inode_lock);
 
+EXPORT_SYMBOL(inode_in_use);
+EXPORT_SYMBOL(inode_unused);
+EXPORT_SYMBOL(inode_lock);
+
 /*
  * iprune_mutex provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
@@ -108,6 +112,14 @@ static void wake_up_inode(struct inode *
 	wake_up_bit(&inode->i_state, __I_LOCK);
 }
 
+static inline void inode_created_by(struct inode *inode, struct task_struct *task)
+{
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	inode->i_cuid = task->uid;
+	memcpy(inode->i_comm, task->comm, sizeof(task->comm));
+#endif
+}
+
 static struct inode *alloc_inode(struct super_block *sb)
 {
 	static const struct address_space_operations empty_aops;
@@ -142,7 +154,7 @@ static struct inode *alloc_inode(struct 
 		inode->i_bdev = NULL;
 		inode->i_cdev = NULL;
 		inode->i_rdev = 0;
-		inode->dirtied_when = 0;
+		inode->dirtied_when = jiffies;
 		if (security_inode_alloc(inode)) {
 			if (inode->i_sb->s_op->destroy_inode)
 				inode->i_sb->s_op->destroy_inode(inode);
@@ -183,6 +195,7 @@ static struct inode *alloc_inode(struct 
 		}
 		inode->i_private = NULL;
 		inode->i_mapping = mapping;
+		inode_created_by(inode, current);
 	}
 	return inode;
 }
@@ -247,6 +260,8 @@ void __iget(struct inode * inode)
 	inodes_stat.nr_unused--;
 }
 
+EXPORT_SYMBOL(__iget);
+
 /**
  * clear_inode - clear an inode
  * @inode: inode to clear
@@ -1353,6 +1368,16 @@ void inode_double_unlock(struct inode *i
 }
 EXPORT_SYMBOL(inode_double_unlock);
 
+
+struct hlist_head * get_inode_hash_budget(unsigned long index)
+{
+       if (index >= (1 << i_hash_shift))
+               return NULL;
+
+       return inode_hashtable + index;
+}
+EXPORT_SYMBOL_GPL(get_inode_hash_budget);
+
 static __initdata unsigned long ihash_entries;
 static int __init set_ihash_entries(char *str)
 {
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -45,6 +45,9 @@
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+EXPORT_SYMBOL(super_blocks);
+EXPORT_SYMBOL(sb_lock);
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -230,6 +230,7 @@ unsigned long shrink_slab(unsigned long 
 	up_read(&shrinker_rwsem);
 	return ret;
 }
+EXPORT_SYMBOL(shrink_slab);
 
 /* Called without lock on whether page is mapped, so answer is unstable */
 static inline int page_mapping_inuse(struct page *page)
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -44,6 +44,7 @@ struct address_space swapper_space = {
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
 };
+EXPORT_SYMBOL_GPL(swapper_space);
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
 
--- linux-2.6.orig/Documentation/filesystems/proc.txt
+++ linux-2.6/Documentation/filesystems/proc.txt
@@ -266,6 +266,7 @@ Table 1-4: Kernel info in /proc
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
  fb	     Frame Buffer devices				(2.4)
+ filecache   Query/drop in-memory file cache
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
  interrupts  Interrupt usage                                   
@@ -456,6 +457,88 @@ varies by architecture and compile optio
 
 > cat /proc/meminfo
 
+..............................................................................
+
+filecache:
+
+Provides access to the in-memory file cache.
+
+To list an index of all cached files:
+
+    echo ls > /proc/filecache
+    cat /proc/filecache
+
+The output looks like:
+
+    # filecache 1.0
+    #      ino       size   cached cached%  state   refcnt  dev             file
+       1026334         91       92    100   --      66      03:02(hda2)     /lib/ld-2.3.6.so
+        233608       1242      972     78   --      66      03:02(hda2)     /lib/tls/libc-2.3.6.so
+         65203        651      476     73   --      1       03:02(hda2)     /bin/bash
+       1026445        261      160     61   --      10      03:02(hda2)     /lib/libncurses.so.5.5
+        235427         10       12    100   --      44      03:02(hda2)     /lib/tls/libdl-2.3.6.so
+
+FIELD	INTRO
+---------------------------------------------------------------------------
+ino	inode number
+size	inode size in KB
+cached	cached size in KB
+cached%	percent of file data cached
+state1	'-' clean; 'd' metadata dirty; 'D' data dirty
+state2	'-' unlocked; 'L' locked, normally indicates file being written out
+refcnt	file reference count, it's an in-kernel one, not exactly open count
+dev	major:minor numbers in hex, followed by a descriptive device name
+file	file path _inside_ the filesystem. There are several special names:
+	'(noname)':	the file name is not available
+	'(03:02)':	the file is a block device file of major:minor
+	'...(deleted)': the named file has been deleted from the disk
+
+To list the cached pages of a perticular file:
+
+    echo /bin/bash > /proc/filecache
+    cat /proc/filecache
+
+    # file /bin/bash
+    # flags R:referenced A:active U:uptodate D:dirty W:writeback M:mmap
+    # idx   len     state   refcnt
+    0       36      RAU__M  3
+    36      1       RAU__M  2
+    37      8       RAU__M  3
+    45      2       RAU___  1
+    47      6       RAU__M  3
+    53      3       RAU__M  2
+    56      2       RAU__M  3
+
+FIELD	INTRO
+----------------------------------------------------------------------------
+idx	page index
+len	number of pages which are cached and share the same state
+state	page state of the flags listed in line two
+refcnt	page reference count
+
+Careful users may notice that the file name to be queried is remembered between
+commands. Internally, the module has a global variable to store the file name
+parameter, so that it can be inherited by newly opened /proc/filecache file.
+However it can lead to interference for multiple queriers. The solution here
+is to obey a rule: only root can interactively change the file name parameter;
+normal users must go for scripts to access the interface. Scripts should do it
+by following the code example below:
+
+    filecache = open("/proc/filecache", "rw");
+    # avoid polluting the global parameter filename
+    filecache.write("set private");
+
+To instruct the kernel to drop clean caches, dentries and inodes from memory,
+causing that memory to become free:
+
+    # drop clean file data cache (i.e. file backed pagecache)
+    echo drop pagecache > /proc/filecache
+
+    # drop clean file metadata cache (i.e. dentries and inodes)
+    echo drop slabcache > /proc/filecache
+
+Note that the drop commands are non-destructive operations and dirty objects
+are not freeable, the user should run `sync' first.
 
 MemTotal:     16344972 kB
 MemFree:      13634064 kB
--- /dev/null
+++ linux-2.6/fs/proc/filecache.c
@@ -0,0 +1,1046 @@
+/*
+ * fs/proc/filecache.c
+ *
+ * Copyright (C) 2006, 2007 Fengguang Wu <wfg@xxxxxxxxxxxxxxxx>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/radix-tree.h>
+#include <linux/page-flags.h>
+#include <linux/pagevec.h>
+#include <linux/pagemap.h>
+#include <linux/vmalloc.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/parser.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+
+/*
+ * Increase minor version when new columns are added;
+ * Increase major version when existing columns are changed.
+ */
+#define FILECACHE_VERSION	"1.0"
+
+/* Internal buffer sizes. The larger the more effcient. */
+#define SBUF_SIZE	(128<<10)
+#define IWIN_PAGE_ORDER	3
+#define IWIN_SIZE	((PAGE_SIZE<<IWIN_PAGE_ORDER) / sizeof(struct inode *))
+
+/*
+ * Session management.
+ *
+ * Each opened /proc/filecache file is assiocated with a session object.
+ * Also there is a global_session that maintains status across open()/close()
+ * (i.e. the lifetime of an opened file), so that a casual user can query the
+ * filecache via _multiple_ simple shell commands like
+ * 'echo cat /bin/bash > /proc/filecache; cat /proc/filecache'.
+ *
+ * session.query_file is the file whose cache info is to be queried.
+ * Its value determines what we get on read():
+ * 	- NULL: ii_*() called to show the inode index
+ * 	- filp: pg_*() called to show the page groups of a filp
+ *
+ * session.query_file is
+ * 	- cloned from global_session.query_file on open();
+ * 	- updated on write("cat filename");
+ * 	  note that the new file will also be saved in global_session.query_file if
+ * 	  session.private_session is false.
+ */
+
+struct session {
+	/* options */
+	int		private_session;
+	unsigned long	ls_options;
+	dev_t		ls_dev;
+
+	/* parameters */
+	struct file	*query_file;
+
+	/* seqfile pos */
+	pgoff_t		start_offset;
+	pgoff_t		next_offset;
+
+	/* inode at last pos */
+	struct {
+		unsigned long pos;
+		unsigned long state;
+		struct inode *inode;
+		struct inode *pinned_inode;
+	} ipos;
+
+	/* inode window */
+	struct {
+		unsigned long cursor;
+		unsigned long origin;
+		unsigned long size;
+		struct inode **inodes;
+	} iwin;
+};
+
+static struct session global_session;
+
+/*
+ * Session address is stored in proc_file->f_ra.start:
+ * we assume that there will be no readahead for proc_file.
+ */
+static struct session *get_session(struct file *proc_file)
+{
+	return (struct session *)proc_file->f_ra.start;
+}
+
+static void set_session(struct file *proc_file, struct session *s)
+{
+	BUG_ON(proc_file->f_ra.start);
+	proc_file->f_ra.start = (unsigned long)s;
+}
+
+static void update_global_file(struct session *s)
+{
+	if (s->private_session)
+		return;
+
+	if (global_session.query_file)
+		fput(global_session.query_file);
+
+	global_session.query_file = s->query_file;
+
+	if (global_session.query_file)
+		get_file(global_session.query_file);
+}
+
+/*
+ * Cases of the name:
+ * 1) NULL                (new session)
+ * 	s->query_file = global_session.query_file = 0;
+ * 2) ""                  (ls/la)
+ * 	s->query_file = global_session.query_file;
+ * 3) a regular file name (cat newfile)
+ * 	s->query_file = global_session.query_file = newfile;
+ */
+static int session_update_file(struct session *s, char *name)
+{
+	static DEFINE_MUTEX(mutex); /* protects global_session.query_file */
+	int err = 0;
+
+	mutex_lock(&mutex);
+
+	/*
+	 * We are to quit, or to list the cached files.
+	 * Reset *.query_file.
+	 */
+	if (!name) {
+		if (s->query_file) {
+			fput(s->query_file);
+			s->query_file = NULL;
+		}
+		update_global_file(s);
+		goto out;
+	}
+
+	/*
+	 * This is a new session.
+	 * Inherit options/parameters from global ones.
+	 */
+	if (name[0] == '\0') {
+		*s = global_session;
+		if (s->query_file)
+			get_file(s->query_file);
+		goto out;
+	}
+
+	/*
+	 * Open the named file.
+	 */
+	if (s->query_file)
+		fput(s->query_file);
+	s->query_file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+	if (IS_ERR(s->query_file)) {
+		err = PTR_ERR(s->query_file);
+		s->query_file = NULL;
+	} else
+		update_global_file(s);
+
+out:
+	mutex_unlock(&mutex);
+
+	return err;
+}
+
+static struct session *session_create(void)
+{
+	struct session *s;
+	int err = 0;
+
+	s = kmalloc(sizeof(*s), GFP_KERNEL);
+	if (s)
+		err = session_update_file(s, "");
+	else
+		err = -ENOMEM;
+
+	return err ? ERR_PTR(err) : s;
+}
+
+static void session_release(struct session *s)
+{
+	if (s->ipos.pinned_inode)
+		iput(s->ipos.pinned_inode);
+	if (s->query_file)
+		fput(s->query_file);
+	kfree(s);
+}
+
+
+/*
+ * Listing of cached files.
+ *
+ * Usage:
+ * 		echo > /proc/filecache  # enter listing mode
+ * 		cat /proc/filecache     # get the file listing
+ */
+
+/* code style borrowed from ib_srp.c */
+enum {
+	LS_OPT_ERR	=	0,
+	LS_OPT_DIRTY	=	1 << 0,
+	LS_OPT_CLEAN	=	1 << 1,
+	LS_OPT_INUSE	=	1 << 2,
+	LS_OPT_EMPTY	=	1 << 3,
+	LS_OPT_ALL	=	1 << 4,
+	LS_OPT_DEV	=	1 << 5,
+};
+
+static match_table_t ls_opt_tokens = {
+	{ LS_OPT_DIRTY,		"dirty" 	},
+	{ LS_OPT_CLEAN,		"clean" 	},
+	{ LS_OPT_INUSE,		"inuse" 	},
+	{ LS_OPT_EMPTY,		"empty"		},
+	{ LS_OPT_ALL,		"all" 		},
+	{ LS_OPT_DEV,		"dev=%s"	},
+	{ LS_OPT_ERR,		NULL 		}
+};
+
+static int ls_parse_options(const char *buf, struct session *s)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *options, *sep_opt;
+	char *p;
+	int token;
+	int ret = 0;
+
+	if (!buf)
+		return 0;
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	s->ls_options = 0;
+	sep_opt = options;
+	while ((p = strsep(&sep_opt, " ")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, ls_opt_tokens, args);
+
+		switch (token) {
+		case LS_OPT_DIRTY:
+		case LS_OPT_CLEAN:
+		case LS_OPT_INUSE:
+		case LS_OPT_EMPTY:
+		case LS_OPT_ALL:
+			s->ls_options |= token;
+			break;
+		case LS_OPT_DEV:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (*p == '/') {
+				struct kstat stat;
+				struct nameidata nd;
+				ret = path_lookup(p, LOOKUP_FOLLOW, &nd);
+				if (!ret)
+					ret = vfs_getattr(nd.path.mnt,
+							  nd.path.dentry, &stat);
+				if (!ret)
+					s->ls_dev = stat.rdev;
+			} else
+				s->ls_dev = simple_strtoul(p, NULL, 0);
+			/* printk("%lx %s\n", (long)s->ls_dev, p); */
+			kfree(p);
+			break;
+
+		default:
+			printk(KERN_WARNING "unknown parameter or missing value "
+			       "'%s' in ls command\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+out:
+	kfree(options);
+	return ret;
+}
+
+/*
+ * Add possible filters here.
+ * No permission check: we cannot verify the path's permission anyway.
+ * We simply demand root previledge for accessing /proc/filecache.
+ */
+static int may_show_inode(struct session *s, struct inode *inode)
+{
+	if (!atomic_read(&inode->i_count))
+		return 0;
+	if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+		return 0;
+	if (!inode->i_mapping)
+		return 0;
+
+	if (s->ls_dev && s->ls_dev != inode->i_sb->s_dev)
+		return 0;
+
+	if (s->ls_options & LS_OPT_ALL)
+		return 1;
+
+	if (!(s->ls_options & LS_OPT_EMPTY) && !inode->i_mapping->nrpages)
+		return 0;
+
+	if ((s->ls_options & LS_OPT_DIRTY) && !(inode->i_state & I_DIRTY))
+		return 0;
+
+	if ((s->ls_options & LS_OPT_CLEAN) && (inode->i_state & I_DIRTY))
+		return 0;
+
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+	      S_ISLNK(inode->i_mode) || S_ISBLK(inode->i_mode)))
+		return 0;
+
+	return 1;
+}
+
+/*
+ * Full: there are more data following.
+ */
+static int iwin_full(struct session *s)
+{
+	return !s->iwin.cursor ||
+		s->iwin.cursor > s->iwin.origin + s->iwin.size;
+}
+
+static int iwin_push(struct session *s, struct inode *inode)
+{
+	if (!may_show_inode(s, inode))
+		return 0;
+
+	s->iwin.cursor++;
+
+	if (s->iwin.size >= IWIN_SIZE)
+		return 1;
+
+	if (s->iwin.cursor > s->iwin.origin)
+		s->iwin.inodes[s->iwin.size++] = inode;
+	return 0;
+}
+
+/*
+ * Travease the inode lists in order - newest first.
+ * And fill @s->iwin.inodes with inodes positioned in [@pos, @pos+IWIN_SIZE).
+ */
+static int iwin_fill(struct session *s, unsigned long pos)
+{
+	struct inode *inode;
+	struct super_block *sb;
+
+	s->iwin.origin = pos;
+	s->iwin.cursor = 0;
+	s->iwin.size = 0;
+
+	/*
+	 * We have a cursor inode, clean and expected to be unchanged.
+	 */
+	if (s->ipos.inode && pos >= s->ipos.pos &&
+			!(s->ipos.state & I_DIRTY) &&
+			s->ipos.state == s->ipos.inode->i_state) {
+		inode = s->ipos.inode;
+		s->iwin.cursor = s->ipos.pos;
+		goto continue_from_saved;
+	}
+
+	if (s->ls_options & LS_OPT_CLEAN)
+		goto clean_inodes;
+
+	spin_lock(&sb_lock);
+	list_for_each_entry(sb, &super_blocks, s_list) {
+		if (s->ls_dev && s->ls_dev != sb->s_dev)
+			continue;
+
+		list_for_each_entry(inode, &sb->s_dirty, i_list) {
+			if (iwin_push(s, inode))
+				goto out_full_unlock;
+		}
+		list_for_each_entry(inode, &sb->s_io, i_list) {
+			if (iwin_push(s, inode))
+				goto out_full_unlock;
+		}
+	}
+	spin_unlock(&sb_lock);
+
+clean_inodes:
+	list_for_each_entry(inode, &inode_in_use, i_list) {
+		if (iwin_push(s, inode))
+			goto out_full;
+continue_from_saved:
+		;
+	}
+
+	if (s->ls_options & LS_OPT_INUSE)
+		return 0;
+
+	list_for_each_entry(inode, &inode_unused, i_list) {
+		if (iwin_push(s, inode))
+			goto out_full;
+	}
+
+	return 0;
+
+out_full_unlock:
+	spin_unlock(&sb_lock);
+out_full:
+	return 1;
+}
+
+static struct inode *iwin_inode(struct session *s, unsigned long pos)
+{
+	if ((iwin_full(s) && pos >= s->iwin.origin + s->iwin.size)
+			  || pos < s->iwin.origin)
+		iwin_fill(s, pos);
+
+	if (pos >= s->iwin.cursor)
+		return NULL;
+
+	s->ipos.pos = pos;
+	s->ipos.inode = s->iwin.inodes[pos - s->iwin.origin];
+	BUG_ON(!s->ipos.inode);
+	return s->ipos.inode;
+}
+
+static void show_inode(struct seq_file *m, struct inode *inode)
+{
+	char state[] = "--"; /* dirty, locked */
+	struct dentry *dentry;
+	loff_t size = i_size_read(inode);
+	unsigned long nrpages;
+	int percent;
+	int refcnt;
+	int shift;
+
+	if (!size)
+		size++;
+
+	if (inode->i_mapping)
+		nrpages = inode->i_mapping->nrpages;
+	else {
+		nrpages = 0;
+		WARN_ON(1);
+	}
+
+	for (shift = 0; (size >> shift) > ULONG_MAX / 128; shift += 12)
+		;
+	percent = min(100UL, (((100 * nrpages) >> shift) << PAGE_CACHE_SHIFT) /
+						(unsigned long)(size >> shift));
+
+	if (inode->i_state & (I_DIRTY_DATASYNC|I_DIRTY_PAGES))
+		state[0] = 'D';
+	else if (inode->i_state & I_DIRTY_SYNC)
+		state[0] = 'd';
+
+	if (inode->i_state & I_LOCK)
+		state[0] = 'L';
+
+	refcnt = 0;
+	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+		refcnt += atomic_read(&dentry->d_count);
+	}
+
+	seq_printf(m, "%10lu %10llu %8lu %7d ",
+			inode->i_ino,
+			DIV_ROUND_UP(size, 1024),
+			nrpages << (PAGE_CACHE_SHIFT - 10),
+			percent);
+
+	seq_printf(m, "%6d %5s %9lu ",
+			refcnt,
+			state,
+			(jiffies - inode->dirtied_when) / HZ);
+
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	seq_printf(m, "%8u %5u %-16s",
+			inode->i_access_count,
+			inode->i_cuid,
+			inode->i_comm);
+#endif
+
+	seq_printf(m, "%02x:%02x(%s)\t",
+			MAJOR(inode->i_sb->s_dev),
+			MINOR(inode->i_sb->s_dev),
+			inode->i_sb->s_id);
+
+	if (list_empty(&inode->i_dentry)) {
+		if (!atomic_read(&inode->i_count))
+			seq_puts(m, "(noname)\n");
+		else
+			seq_printf(m, "(%02x:%02x)\n",
+					imajor(inode), iminor(inode));
+	} else {
+		struct path path = {
+			.mnt = NULL,
+			.dentry = list_entry(inode->i_dentry.next,
+					     struct dentry, d_alias)
+		};
+
+		seq_path(m, &path, " \t\n\\");
+		seq_putc(m, '\n');
+	}
+}
+
+static int ii_show(struct seq_file *m, void *v)
+{
+	unsigned long index = *(loff_t *) v;
+	struct session *s = m->private;
+        struct inode *inode;
+
+	if (index == 0) {
+		seq_puts(m, "# filecache " FILECACHE_VERSION "\n");
+		seq_puts(m, "#      ino       size   cached cached% "
+				"refcnt state       age "
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+				"accessed   uid process         "
+#endif
+				"dev\t\tfile\n");
+	}
+
+        inode = iwin_inode(s,index);
+	show_inode(m, inode);
+
+	return 0;
+}
+
+static void *ii_start(struct seq_file *m, loff_t *pos)
+{
+	struct session *s = m->private;
+
+	s->iwin.size = 0;
+	s->iwin.inodes = (struct inode **)
+				__get_free_pages( GFP_KERNEL, IWIN_PAGE_ORDER);
+	if (!s->iwin.inodes)
+		return NULL;
+
+	spin_lock(&inode_lock);
+
+	return iwin_inode(s, *pos) ? pos : NULL;
+}
+
+static void *ii_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct session *s = m->private;
+
+	(*pos)++;
+	return iwin_inode(s, *pos) ? pos : NULL;
+}
+
+static void ii_stop(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct inode *inode = s->ipos.inode;
+
+	if (!s->iwin.inodes)
+		return;
+
+	if (inode) {
+		__iget(inode);
+		s->ipos.state = inode->i_state;
+	}
+	spin_unlock(&inode_lock);
+
+	free_pages((unsigned long) s->iwin.inodes, IWIN_PAGE_ORDER);
+	if (s->ipos.pinned_inode)
+		iput(s->ipos.pinned_inode);
+	s->ipos.pinned_inode = inode;
+}
+
+/*
+ * Listing of cached page ranges of a file.
+ *
+ * Usage:
+ * 		echo 'file name' > /proc/filecache
+ * 		cat /proc/filecache
+ */
+
+unsigned long page_mask;
+#define PG_MMAP		PG_lru		/* reuse any non-relevant flag */
+#define PG_BUFFER	PG_swapcache	/* ditto */
+#define PG_DIRTY	PG_error	/* ditto */
+#define PG_WRITEBACK	PG_buddy	/* ditto */
+
+/*
+ * Page state names, prefixed by their abbreviations.
+ */
+struct {
+	unsigned long	mask;
+	const char     *name;
+	int		faked;
+} page_flag [] = {
+	{1 << PG_referenced,	"R:referenced",	0},
+	{1 << PG_active,	"A:active",	0},
+	{1 << PG_MMAP,		"M:mmap",	1},
+
+	{1 << PG_uptodate,	"U:uptodate",	0},
+	{1 << PG_dirty,		"D:dirty",	0},
+	{1 << PG_writeback,	"W:writeback",	0},
+	{1 << PG_reclaim,	"X:readahead",	0},
+
+	{1 << PG_private,	"P:private",	0},
+	{1 << PG_owner_priv_1,	"O:owner",	0},
+
+	{1 << PG_BUFFER,	"b:buffer",	1},
+	{1 << PG_DIRTY,		"d:dirty",	1},
+	{1 << PG_WRITEBACK,	"w:writeback",	1},
+};
+
+static unsigned long page_flags(struct page* page)
+{
+	unsigned long flags;
+	struct address_space *mapping = page_mapping(page);
+
+	flags = page->flags & page_mask;
+
+	if (page_mapped(page))
+		flags |= (1 << PG_MMAP);
+
+	if (page_has_buffers(page))
+		flags |= (1 << PG_BUFFER);
+
+	if (mapping) {
+		if (radix_tree_tag_get(&mapping->page_tree,
+					page_index(page),
+					PAGECACHE_TAG_WRITEBACK))
+			flags |= (1 << PG_WRITEBACK);
+
+		if (radix_tree_tag_get(&mapping->page_tree,
+					page_index(page),
+					PAGECACHE_TAG_DIRTY))
+			flags |= (1 << PG_DIRTY);
+	}
+
+	return flags;
+}
+
+static int pages_similiar(struct page* page0, struct page* page)
+{
+	if (page_count(page0) != page_count(page))
+		return 0;
+
+	if (page_flags(page0) != page_flags(page))
+		return 0;
+
+	return 1;
+}
+
+static void show_range(struct seq_file *m, struct page* page, unsigned long len)
+{
+	int i;
+	unsigned long flags;
+
+	if (!m || !page)
+		return;
+
+	seq_printf(m, "%lu\t%lu\t", page->index, len);
+
+	flags = page_flags(page);
+	for (i = 0; i < ARRAY_SIZE(page_flag); i++)
+		seq_putc(m, (flags & page_flag[i].mask) ?
+					page_flag[i].name[0] : '_');
+
+	seq_printf(m, "\t%d\n", page_count(page));
+}
+
+#define BATCH_LINES	100
+static pgoff_t show_file_cache(struct seq_file *m,
+				struct address_space *mapping, pgoff_t start)
+{
+	int i;
+	int lines = 0;
+	pgoff_t len = 0;
+	struct pagevec pvec;
+	struct page *page;
+	struct page *page0 = NULL;
+
+	for (;;) {
+		pagevec_init(&pvec, 0);
+		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
+
+		if (pvec.nr == 0) {
+			show_range(m, page0, len);
+			start = ULONG_MAX;
+			goto out;
+		}
+
+		if (!page0)
+			page0 = pvec.pages[0];
+
+		for (i = 0; i < pvec.nr; i++) {
+			page = pvec.pages[i];
+
+			if (page->index == start + len &&
+					pages_similiar(page0, page))
+				len++;
+			else {
+				show_range(m, page0, len);
+				page0 = page;
+				start = page->index;
+				len = 1;
+				if (++lines > BATCH_LINES)
+					goto out;
+			}
+		}
+	}
+
+out:
+	return start;
+}
+
+static int pg_show(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+	pgoff_t offset;
+
+	if (!file)
+		return ii_show(m, v);
+
+	offset = *(loff_t *) v;
+
+	if (!offset) { /* print header */
+		int i;
+
+		seq_puts(m, "# file ");
+		seq_path(m, &file->f_path, " \t\n\\");
+
+		seq_puts(m, "\n# flags");
+		for (i = 0; i < ARRAY_SIZE(page_flag); i++)
+			seq_printf(m, " %s", page_flag[i].name);
+
+		seq_puts(m, "\n# idx\tlen\tstate\t\trefcnt\n");
+	}
+
+	s->start_offset = offset;
+	s->next_offset = show_file_cache(m, file->f_mapping, offset);
+
+	return 0;
+}
+
+static void *file_pos(struct file *file, loff_t *pos)
+{
+	loff_t size = i_size_read(file->f_mapping->host);
+	pgoff_t end = DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
+	pgoff_t offset = *pos;
+
+	return offset < end ? pos : NULL;
+}
+
+static void *pg_start(struct seq_file *m, loff_t *pos)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+	pgoff_t offset = *pos;
+
+	if (!file)
+		return ii_start(m, pos);
+
+	rcu_read_lock();
+
+	if (offset - s->start_offset == 1)
+		*pos = s->next_offset;
+	return file_pos(file, pos);
+}
+
+static void *pg_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+
+	if (!file)
+		return ii_next(m, v, pos);
+
+	*pos = s->next_offset;
+	return file_pos(file, pos);
+}
+
+static void pg_stop(struct seq_file *m, void *v)
+{
+	struct session *s = m->private;
+	struct file *file = s->query_file;
+
+	if (!file)
+		return ii_stop(m, v);
+
+	rcu_read_unlock();
+}
+
+struct seq_operations seq_filecache_op = {
+	.start	= pg_start,
+	.next	= pg_next,
+	.stop	= pg_stop,
+	.show	= pg_show,
+};
+
+/*
+ * Implement the manual drop-all-pagecache function
+ */
+
+#define MAX_INODES	(PAGE_SIZE / sizeof(struct inode *))
+static int drop_pagecache(void)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct inode *inode;
+	struct inode **inodes;
+	unsigned long i, j, k;
+	int err = 0;
+
+	inodes = (struct inode **)__get_free_pages(GFP_KERNEL, IWIN_PAGE_ORDER);
+	if (!inodes)
+		return -ENOMEM;
+
+	for (i = 0; (head = get_inode_hash_budget(i)); i++) {
+		if (hlist_empty(head))
+			continue;
+
+		j = 0;
+		cond_resched();
+
+		/*
+		 * Grab some inodes.
+		 */
+		spin_lock(&inode_lock);
+		hlist_for_each (node, head) {
+			inode = hlist_entry(node, struct inode, i_hash);
+			if (!atomic_read(&inode->i_count))
+				continue;
+			if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+				continue;
+			if (!inode->i_mapping || !inode->i_mapping->nrpages)
+				continue;
+			__iget(inode);
+			inodes[j++] = inode;
+			if (j >= MAX_INODES)
+				break;
+		}
+		spin_unlock(&inode_lock);
+
+		/*
+		 * Free clean pages.
+		 */
+		for (k = 0; k < j; k++) {
+			inode = inodes[k];
+			invalidate_mapping_pages(inode->i_mapping, 0, ~1);
+			iput(inode);
+		}
+
+		/*
+		 * Simply ignore the remaining inodes.
+		 */
+		if (j >= MAX_INODES && !err) {
+			printk(KERN_WARNING
+				"Too many collides in inode hash table.\n"
+				"Pls boot with a larger ihash_entries=XXX.\n");
+			err = -EAGAIN;
+		}
+	}
+
+	free_pages((unsigned long) inodes, IWIN_PAGE_ORDER);
+	return err;
+}
+
+static void drop_slabcache(void)
+{
+	int nr_objects;
+
+	do {
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+	} while (nr_objects > 10);
+}
+
+/*
+ * Proc file operations.
+ */
+
+static int filecache_open(struct inode *inode, struct file *proc_file)
+{
+	struct seq_file *m;
+	struct session *s;
+	unsigned size;
+	char *buf = 0;
+	int ret;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENOENT;
+
+	s = session_create();
+	if (IS_ERR(s)) {
+		ret = PTR_ERR(s);
+		goto out;
+	}
+	set_session(proc_file, s);
+
+	size = SBUF_SIZE;
+	buf = kmalloc(size, GFP_KERNEL);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = seq_open(proc_file, &seq_filecache_op);
+	if (!ret) {
+		m = proc_file->private_data;
+		m->private = s;
+		m->buf = buf;
+		m->size = size;
+	}
+
+out:
+	if (ret) {
+		kfree(s);
+		kfree(buf);
+		module_put(THIS_MODULE);
+	}
+	return ret;
+}
+
+static int filecache_release(struct inode *inode, struct file *proc_file)
+{
+	struct session *s = get_session(proc_file);
+	int ret;
+
+	session_release(s);
+	ret = seq_release(inode, proc_file);
+	module_put(THIS_MODULE);
+	return ret;
+}
+
+ssize_t filecache_write(struct file *proc_file, const char __user * buffer,
+			size_t count, loff_t *ppos)
+{
+	struct session *s;
+	char *name;
+	int err = 0;
+
+	if (count >= PATH_MAX + 5)
+		return -ENAMETOOLONG;
+
+	name = kmalloc(count+1, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	if (copy_from_user(name, buffer, count)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	/* strip the optional newline */
+	if (count && name[count-1] == '\n')
+		name[count-1] = '\0';
+	else
+		name[count] = '\0';
+
+	s = get_session(proc_file);
+	if (!strcmp(name, "set private")) {
+		s->private_session = 1;
+		goto out;
+	}
+
+	if (!strncmp(name, "cat ", 4)) {
+		err = session_update_file(s, name+4);
+		goto out;
+	}
+
+	if (!strncmp(name, "ls", 2)) {
+		err = session_update_file(s, NULL);
+		if (!err)
+			err = ls_parse_options(name+2, s);
+		if (!err && !s->private_session) {
+			global_session.ls_dev = s->ls_dev;
+			global_session.ls_options = s->ls_options;
+		}
+		goto out;
+	}
+
+	if (!strncmp(name, "drop pagecache", 14)) {
+		err = drop_pagecache();
+		goto out;
+	}
+
+	if (!strncmp(name, "drop slabcache", 14)) {
+		drop_slabcache();
+		goto out;
+	}
+
+	/* err = -EINVAL; */
+	err = session_update_file(s, name);
+
+out:
+	kfree(name);
+
+	return err ? err : count;
+}
+
+static struct file_operations proc_filecache_fops = {
+	.owner		= THIS_MODULE,
+	.open		= filecache_open,
+	.release	= filecache_release,
+	.write		= filecache_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+};
+
+
+static __init int filecache_init(void)
+{
+	int i;
+	struct proc_dir_entry *entry;
+
+	entry = create_proc_entry("filecache", 0600, NULL);
+	if (entry)
+		entry->proc_fops = &proc_filecache_fops;
+
+	for (page_mask = i = 0; i < ARRAY_SIZE(page_flag); i++)
+		if (!page_flag[i].faked)
+			page_mask |= page_flag[i].mask;
+
+	return 0;
+}
+
+static void filecache_exit(void)
+{
+	remove_proc_entry("filecache", NULL);
+	if (global_session.query_file)
+		fput(global_session.query_file);
+}
+
+MODULE_AUTHOR("Fengguang Wu <wfg@xxxxxxxxxxxxxxxx>");
+MODULE_LICENSE("GPL");
+
+module_init(filecache_init);
+module_exit(filecache_exit);
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -685,6 +685,12 @@ struct inode {
 	void			*i_security;
 #endif
 	void			*i_private; /* fs or device private pointer */
+
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	unsigned int		i_access_count;	/* opened how many times? */
+	uid_t			i_cuid;		/* opened first by which user? */
+	char			i_comm[16];	/* opened first by which app? */
+#endif
 };
 
 /*
@@ -773,6 +779,13 @@ static inline unsigned imajor(const stru
 	return MAJOR(inode->i_rdev);
 }
 
+static inline void inode_accessed(struct inode *inode)
+{
+#ifdef CONFIG_PROC_FILECACHE_EXTRAS
+	inode->i_access_count++;
+#endif
+}
+
 extern struct block_device *I_BDEV(struct inode *inode);
 
 struct fown_struct {
@@ -1907,6 +1920,7 @@ extern void remove_inode_hash(struct ino
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode->i_ino);
 }
+struct hlist_head * get_inode_hash_budget(unsigned long index);
 
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -828,6 +828,7 @@ static struct file *__dentry_open(struct
 			goto cleanup_all;
 	}
 
+	inode_accessed(inode);
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 
 	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -750,6 +750,36 @@ config CONFIGFS_FS
 	  Both sysfs and configfs can and should exist together on the
 	  same system. One is not a replacement for the other.
 
+config PROC_FILECACHE
+	tristate "/proc/filecache support"
+	default m
+	depends on PROC_FS
+	help
+	  This option creates a file /proc/filecache which enables one to
+	  query/drop the cached files in memory.
+
+	  A quick start guide:
+
+	  # echo 'ls' > /proc/filecache
+	  # head /proc/filecache
+
+	  # echo 'cat /bin/bash' > /proc/filecache
+	  # head /proc/filecache
+
+	  # echo 'drop pagecache' > /proc/filecache
+	  # echo 'drop slabcache' > /proc/filecache
+
+	  For more details, please check Documentation/filesystems/proc.txt .
+
+	  It can be a handy tool for sysadms and desktop users.
+
+config PROC_FILECACHE_EXTRAS
+	bool "track extra states"
+	default y
+	depends on PROC_FILECACHE
+	help
+	  Track extra states that costs a little more time/space.
+
 endmenu
 
 menu "Miscellaneous filesystems"
--- linux-2.6.orig/fs/proc/Makefile
+++ linux-2.6/fs/proc/Makefile
@@ -2,7 +2,8 @@
 # Makefile for the Linux proc filesystem routines.
 #
 
-obj-$(CONFIG_PROC_FS) += proc.o
+obj-$(CONFIG_PROC_FS)		+= proc.o
+obj-$(CONFIG_PROC_FILECACHE)	+= filecache.o
 
 proc-y			:= nommu.o task_nommu.o
 proc-$(CONFIG_MMU)	:= mmu.o task_mmu.o
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -23,9 +23,66 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/pagevec.h>
 #include "internal.h"
 
 
+int sysctl_dirty_debug __read_mostly;
+
+void print_page(struct page *page)
+{
+	printk(KERN_DEBUG "%lu\t%u\t%u\t%c%c%c%c%c\n",
+			page->index,
+			page_count(page),
+			page_mapcount(page),
+			PageUptodate(page)      ? 'U' : '_',
+			PageDirty(page)         ? 'D' : '_',
+			PageWriteback(page)     ? 'W' : '_',
+			PagePrivate(page)       ? 'P' : '_',
+			PageLocked(page)        ? 'L' : '_');
+}
+
+void print_inode_pages(struct inode *inode)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct pagevec pvec;
+	int nr_pages;
+	int i;
+	struct dentry *dentry;
+	int dcount;
+	char *dname;
+
+	rcu_read_lock();
+	nr_pages = radix_tree_gang_lookup_tag(&mapping->page_tree,
+				(void **)pvec.pages, 0, PAGEVEC_SIZE,
+				PAGECACHE_TAG_DIRTY);
+	rcu_read_unlock();
+
+	if (list_empty(&inode->i_dentry)) {
+		dname = "";
+		dcount = 0;
+	} else {
+		dentry = list_entry(inode->i_dentry.next,
+					struct dentry, d_alias);
+		dname = dentry->d_iname;
+		dcount = atomic_read(&dentry->d_count);
+	}
+
+	printk(KERN_DEBUG "inode %lu(%s/%s) count %d,%d size %llu pages %lu\n",
+			inode->i_ino,
+			inode->i_sb->s_id,
+			dname,
+			atomic_read(&inode->i_count),
+			dcount,
+			i_size_read(inode),
+			mapping->nrpages
+	      );
+
+	for (i = 0; i < nr_pages; i++)
+		print_page(pvec.pages[i]);
+}
+
+
 /**
  * writeback_acquire - attempt to get exclusive writeback access to a device
  * @bdi: the device's backing_dev_info structure
@@ -179,6 +236,11 @@ static int write_inode(struct inode *ino
 	return 0;
 }
 
+#define redirty_tail(inode)						\
+	do {								\
+		__redirty_tail(inode, __LINE__);			\
+	} while (0)
+
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -188,10 +250,16 @@ static int write_inode(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void __redirty_tail(struct inode *inode, int line)
 {
 	struct super_block *sb = inode->i_sb;
 
+	if (sysctl_dirty_debug) {
+		printk(KERN_DEBUG "redirty_tail line %d: inode %lu\n",
+				line, inode->i_ino);
+		print_inode_pages(inode);
+	}
+
 	if (!list_empty(&sb->s_dirty)) {
 		struct inode *tail_inode;
 
@@ -203,12 +271,23 @@ static void redirty_tail(struct inode *i
 	list_move(&inode->i_list, &sb->s_dirty);
 }
 
+#define requeue_io(inode)						\
+	do {								\
+		__requeue_io(inode, __LINE__);				\
+	} while (0)
+
 /*
  * requeue inode for re-scanning after sb->s_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void __requeue_io(struct inode *inode, int line)
 {
 	list_move(&inode->i_list, &inode->i_sb->s_more_io);
+
+	if (sysctl_dirty_debug) {
+		printk(KERN_DEBUG "requeue_io line %d: inode %lu\n",
+				line, inode->i_ino);
+		print_inode_pages(inode);
+	}
 }
 
 static void inode_sync_complete(struct inode *inode)
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -159,5 +159,6 @@ void writeback_set_ratelimit(void);
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */
 
+extern int sysctl_dirty_debug;
 
 #endif		/* WRITEBACK_H */
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -1344,6 +1344,14 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "dirty_debug",
+		.data		= &sysctl_dirty_debug,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.ctl_name	= CTL_UNNUMBERED,
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -104,6 +104,35 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
+#define writeback_debug_report(n, wbc) do {                             \
+	if(sysctl_dirty_debug)						\
+		__writeback_debug_report(n, wbc,			\
+				__FILE__, __LINE__, __FUNCTION__);	\
+} while (0)
+
+void print_writeback_control(struct writeback_control *wbc)
+{
+	printk(KERN_DEBUG
+			"global dirty %lu writeback %lu nfs %lu "
+			"flags %c%c towrite %ld skipped %ld\n",
+			global_page_state(NR_FILE_DIRTY),
+			global_page_state(NR_WRITEBACK),
+			global_page_state(NR_UNSTABLE_NFS),
+			wbc->encountered_congestion ? 'C':'_',
+			wbc->more_io ? 'M':'_',
+			wbc->nr_to_write,
+			wbc->pages_skipped);
+}
+
+void __writeback_debug_report(long n, struct writeback_control *wbc,
+		const char *file, int line, const char *func)
+{
+	printk(KERN_DEBUG "%s %d %s: %s(%d) %ld\n",
+			file, line, func,
+			current->comm, current->pid,
+			n);
+	print_writeback_control(wbc);
+}
 
 static void background_writeout(unsigned long _min_pages);
 
@@ -476,6 +505,7 @@ static void balance_dirty_pages(struct a
 			pages_written += write_chunk - wbc.nr_to_write;
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
+			writeback_debug_report(pages_written, &wbc);
 		}
 
 		/*
@@ -502,6 +532,7 @@ static void balance_dirty_pages(struct a
 			break;		/* We've done our duty */
 
 		congestion_wait(WRITE, HZ/10);
+		writeback_debug_report(-pages_written, &wbc);
 	}
 
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -596,6 +627,11 @@ void throttle_vm_writeout(gfp_t gfp_mask
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
                 congestion_wait(WRITE, HZ/10);
+		printk(KERN_DEBUG "throttle_vm_writeout: "
+				"congestion_wait on %lu+%lu > %lu\n",
+				global_page_state(NR_UNSTABLE_NFS),
+				global_page_state(NR_WRITEBACK),
+				dirty_thresh);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -645,7 +681,9 @@ static void background_writeout(unsigned
 			else
 				break;
 		}
+		writeback_debug_report(min_pages, &wbc);
 	}
+	writeback_debug_report(min_pages, &wbc);
 }
 
 /*
@@ -718,7 +756,9 @@ static void wb_kupdate(unsigned long arg
 				break;	/* All the old data is written */
 		}
 		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		writeback_debug_report(nr_to_write, &wbc);
 	}
+	writeback_debug_report(nr_to_write, &wbc);
 	if (time_before(next_jif, jiffies + HZ))
 		next_jif = jiffies + HZ;
 	if (dirty_writeback_interval)