Re: [RFC][PATCH] JBD: release checkpoint journal heads through try_to_release_page when the memory is exhausted

Toshiyuki Okajima <toshi.okajima@xxxxxxxxxxxxxx> · Thu, 23 Oct 2008 17:41:01 +0900

Hi Andrew.

> > rather costly.  An alternative might be to implement a shrinker
> > callback function for the journal_head slab cache.  Did you consider
> > this?
> Yes.
> But the unused-list and counters are required by managing the shrink targets("journal head") 
> if we implement a shrinker. 
> I thought that comparatively big code changes were necessary for jbd to accomplish it. 

> However I will try it. 

I managed to build a shrinker callback function for the journal_head slab cache.
This code size is less than before but the logic of it seems to be more complex
 than before.
However, I haven't got any troubles while I am testing some easy load operations
on the fixed kernel.
But I think a system may hang up if concurrently several journal_head shrinker 
are executed.
So, I will retry to build more appropriate fix.

Please give me comments if you have a nicer idea.

------------------------------------------------------------------------------
The direct data blocks can be released by the member function, releasepage()
of their mapping.
(They have the mapping of their filesystem i-node.)

On the other hand, the indirect data blocks (ext3) are attempted to be released 
by try_to_free_buffers().
Because its mapping is a block device, and a block device doesn't have 
own a member function to release a page. 

But try_to_free_buffers() is a generic function which releases a buffer_head, 
and no buffer_head can be released if a buffer_head has private data 
(like journal_head) because the buffer_head reference counter is bigger than 0.
Therefore, a buffer_head cannot be released by try_to_free_buffers() even if 
its private data can be released.
As a result, oom-killer may happen when a system memory is exhausted even if 
a lot of private data can be released.

To solve this situation, a shrinker of journal_heads is required.
A shrinker was made by referring to logics such as shrink_icache_memory. 
In order to shrink journal_heads, it is necessary to manage a list of 
journal_heads which are required to be checkpointed all over the filesystems
 with jbd. 

Timing from which the newly additional list is operated:
- when a journal_head is registered into a checkpoint list. It is also 
 registered into an overall checkpoint list (newly additional list).
- when a journal_head is removed from a checkpoint list. It is also removed 
 from an overall checkpoint list (newly additional list).
- while a shrinker is working.

A shrinker scans only a necessary number of journal_heads which are connected
 from a new list, and releases ones if possible.

Therefore it becomes difficult for oom-killer to happen than before.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@xxxxxxxxxxxxxx>
---
 fs/jbd/checkpoint.c          |   77 +++++++++++++++++++++++++++++++++++++++++++
 fs/jbd/journal.c             |    2 +
 include/linux/journal-head.h |    7 +++
 3 files changed, 86 insertions(+)

diff -Nurp linux-2.6.27.1.org/fs/jbd/checkpoint.c linux-2.6.27.1/fs/jbd/checkpoint.c

--- linux-2.6.27.1.org/fs/jbd/checkpoint.c	2008-10-16 08:02:53.000000000 +0900
+++ linux-2.6.27.1/fs/jbd/checkpoint.c	2008-10-23 15:07:14.000000000 +0900
@@ -24,6 +24,14 @@
 #include <linux/slab.h>
 
 /*
+ * Used for shrinking journal_heads whose I/O are completed
+ */
+static DEFINE_SPINLOCK(jbd_global_lock);
+static LIST_HEAD(jbd_checkpoint_list);
+static int jbd_jh_cache_pressure = 10;
+static int jbd_nr_checkpoint_jh = 0;
+
+/*
  * Unlink a buffer from a transaction checkpoint list.
  *
  * Called with j_list_lock held.
@@ -595,6 +603,10 @@ int __journal_remove_checkpoint(struct j
 
 	__buffer_unlink(jh);
 	jh->b_cp_transaction = NULL;
+	spin_lock(&jbd_global_lock);
+	list_del_init(&jh->b_checkpoint_list);
+	jbd_nr_checkpoint_jh--;
+	spin_unlock(&jbd_global_lock);
 
 	if (transaction->t_checkpoint_list != NULL ||
 	    transaction->t_checkpoint_io_list != NULL)
@@ -655,8 +667,73 @@ void __journal_insert_checkpoint(struct 
 		jh->b_cpnext->b_cpprev = jh;
 	}
 	transaction->t_checkpoint_list = jh;
+	spin_lock(&jbd_global_lock);
+	list_add(&jh->b_checkpoint_list, &jbd_checkpoint_list);
+	jbd_nr_checkpoint_jh++;
+	spin_unlock(&jbd_global_lock);
+}
+
+static void try_to_free_cp_buf(journal_t *journal, transaction_t *transaction, struct journal_head *jh)
+{
+	transaction_t *transaction2;
+
+	spin_lock(&journal->j_list_lock);
+	if (!list_empty(&jh->b_checkpoint_list)) {
+		transaction2 = jh->b_cp_transaction;
+		BUG_ON(transaction2 == NULL);
+		if (transaction == transaction2) {
+			jbd_lock_bh_state(jh2bh(jh));
+			__try_to_free_cp_buf(jh);
+		}
+	}
+	spin_unlock(&journal->j_list_lock);
 }
 
+static void prune_jbd_jhcache(int nr)
+{
+	struct journal_head *jh;
+	struct list_head *tmp;
+	journal_t *journal;
+	transaction_t *transaction;
+
+	BUG_ON(nr < 0);
+	for (; nr; nr--) {
+		spin_lock(&jbd_global_lock);
+		if ((tmp = jbd_checkpoint_list.prev) == &jbd_checkpoint_list) {
+			spin_unlock(&jbd_global_lock);
+			break;
+		}
+		list_move(tmp, &jbd_checkpoint_list);
+		jh = list_entry(tmp, struct journal_head, b_checkpoint_list);
+		/* Protect a jh from being removed while operating */
+		journal_grab_journal_head(jh2bh(jh));
+		transaction = jh->b_cp_transaction;
+		BUG_ON(transaction == NULL);
+		journal = transaction->t_journal;
+		spin_unlock(&jbd_global_lock);
+		/* Releasing a jh from checkpoint list if possible */
+		try_to_free_cp_buf(journal, transaction, jh);
+		/* For previous count up (actually releasing a jh here) */
+		journal_put_journal_head(jh);
+		cond_resched();
+	}
+}
+
+static int shrink_jbd_jhcache_memory(int nr, gfp_t gfp_mask)
+{
+	if (nr) {
+		if (!(gfp_mask & __GFP_FS))
+			return -1;
+                prune_jbd_jhcache(nr);
+	}
+        return (jbd_nr_checkpoint_jh*100)/jbd_jh_cache_pressure;
+}
+
+struct shrinker jbd_jh_shrinker = {
+        .shrink = shrink_jbd_jhcache_memory,
+        .seeks = DEFAULT_SEEKS,
+};
+
 /*
  * We've finished with this transaction structure: adios...
  *
diff -Nurp linux-2.6.27.1.org/fs/jbd/journal.c linux-2.6.27.1/fs/jbd/journal.c
--- linux-2.6.27.1.org/fs/jbd/journal.c	2008-10-16 08:02:53.000000000 +0900
+++ linux-2.6.27.1/fs/jbd/journal.c	2008-10-23 15:00:44.000000000 +0900
@@ -1890,6 +1890,7 @@ static inline void jbd_remove_debugfs_en
 
 #endif
 
+extern struct shrinker jbd_jh_shrinker;
 struct kmem_cache *jbd_handle_cache;
 
 static int __init journal_init_handle_cache(void)
@@ -1903,6 +1904,7 @@ static int __init journal_init_handle_ca
 		printk(KERN_EMERG "JBD: failed to create handle cache\n");
 		return -ENOMEM;
 	}
+	register_shrinker(&jbd_jh_shrinker);
 	return 0;
 }
 
diff -Nurp linux-2.6.27.1.org/include/linux/journal-head.h linux-2.6.27.1/include/linux/journal-head.h
--- linux-2.6.27.1.org/include/linux/journal-head.h	2008-10-16 08:02:53.000000000 +0900
+++ linux-2.6.27.1/include/linux/journal-head.h	2008-10-23 15:00:44.000000000 +0900
@@ -87,6 +87,13 @@ struct journal_head {
 	 * [j_list_lock]
 	 */
 	struct journal_head *b_cpnext, *b_cpprev;
+
+	/*
+	 * Checkpoint journal head list
+	 * all over filesystems with jbd in order to shrink.
+	 * [jbd_global_lock]
+	 */
+	struct list_head b_checkpoint_list;
 };
 
 #endif		/* JOURNAL_HEAD_H_INCLUDED */
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html