Re: [RFC][PATCH] fix journal overflow problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



  Hello,

On Thu 21-02-08 13:58:55, Josef Bacik wrote:
> This is related to that jbd patch I sent a few weeks ago.  I originally
> found that the problem where t_nr_buffers would be > than
> t_outstanding_credits wouldn't happen upstream, but apparently I'm an
> idiot and I was just missing my messsages, and the problem does exist.
> Now for the entirely too long of a description of whats going wrong.
> 
> Say we have a transaction dirty a bitmap buffer and go to flush it to the
> disk.  Then ext3 goes to get write access to that buffer via
> journal_get_undo_access(), finds out it doesn't need it, and then
> subsequently does a journal_release_buffer() and then proceeds to never
> touch that buffer again.  Then the original committing transaction will
> go through and add its buffers to the checkpointing list, and refile the
> buffer.  Because we did a journal_get_undo_access(), the
> jh->b_next_transaction is set to our currently running transaction, and
> that buffer because it was set BH_JBDDirty by the committing transaction
> is filed onto the running transactions BJ_Metadata list, which increments
> our t_nr_buffers counter.  Because we never actually dirtied this buffer
> ourselves, we never accounted for the credit, and we end up with
> t_outstanding_credits being less than t_nr_buffers.
  Thanks for the debugging. You're right that such situation can happen
and we then miscount the transactions credits. Actually, we miscount the
credits whenever we do journal_get_write_access() on a jbd_dirty buffer
that isn't yet in our transaction and don't call journal_dirty_metadata()
later.

> This is a problem because while we are writing out the metadata blocks to
> the journal, we do a t_outstanding_credits-- for each buffer.  If
> t_outstanding_credits is less than the number of buffers we have then
> t_outstanding_credits will eventually become negative, which means that
> jbd_space_needed will eventually start saying it needs way less credits
> than it actually does, and will allow transactions to grow huge and
> eventually we'll overflow the journal (albeit this is a bitch to try and
> reproduce).
  Yes, actually, how much negative t_outstanding_credits grow? I'd expect
that this is not too common situation...

> So my approach is to have a counter that is incremented each time the
> transaction calls do_get_write_access (or journal_get_create_access) so
> we can keep track of how many people are currently trying to modify that
> buffer.  So in the case where we do a
> journal_get_undo_access()+journal_release_buffer() and nobody else ever
> touches the buffer, we can then set jh->b_next_transaction to NULL in
> journal_release_buffer() and avoid having the buffer filed onto our
> transaction.  If somebody else is modifying the journal head then we know
> to leave it alone because chances are it will be dirtied and the credit
> will be accounted for.
  But the race is still possibly there in case we refile the buffer from
t_forget list just between do_get_write_access() and
journal_release_buffer(), isn't it?
  And it would be quite hard to get rid of such races. So maybe how about
the following: In do_get_write_access() (or journal_get_create_access())
when we see the buffer is jbddirty and we set j_next_transaction to our
transaction, we also set j_modified to 1. That should fix the accounting of
transaction credits. I agree that sometimes we needlessly refile some
buffers from the previous transaction but as I said above, it shouldn't be
that much (and we did it up to now anyway).
 
> There is also a slight change to how we reset b_modified.  I originally
> reset b_nr_access (my access counter) in the same way b_modified was
> reset, but I didn't really like this because we were only taking the
> j_list_lock instead of the jbd_buffer lock, so we could race and still
> end up in the same situation (which is in fact what happened).  So I've
  Yes, that is a good catch.

> changed how we reset b_modified.  Instead of looping through all of the
> buffers for the transaction, which is a little innefficient anyway, we
> reset it in do_get_write_access in the cases where we know that this is
> the first time that this transaction has accessed the buffer (ie when
> b_next_transaction != transaction && b_transaction != transaction).  I
> reset b_nr_access in the same way.  I ran tests with this patch and
> verified that we no longer got into the situation where
> t_outstanding_credits was less than t_nr_buffers.
> 
> This is just my patch that I was using, I plan on cleaning it up if this
> is an acceptable way to fix the problem.  I'd also like to put an ASSERT
> before we process the t_buffers list for the case where
> t_outstanding_credits is less than t_nr_buffers.  If my particular
> solution isn't acceptable I'm open to suggestions, however I still think
> that resetting b_modified should be changed the way I have it as its a
> potential race condition and innefficient.  Thanks much,
  I agree with the b_modified change, but please send it as a separate
patch. For the credit accounting I'd rather go by the route I've suggested
above.

								Honza

> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index a38c718..6cc0a1e 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -407,22 +407,6 @@ void journal_commit_transaction(journal_t *journal)
>  	jbd_debug (3, "JBD: commit phase 2\n");
>  
>  	/*
> -	 * First, drop modified flag: all accesses to the buffers
> -	 * will be tracked for a new trasaction only -bzzz
> -	 */
> -	spin_lock(&journal->j_list_lock);
> -	if (commit_transaction->t_buffers) {
> -		new_jh = jh = commit_transaction->t_buffers->b_tnext;
> -		do {
> -			J_ASSERT_JH(new_jh, new_jh->b_modified == 1 ||
> -					new_jh->b_modified == 0);
> -			new_jh->b_modified = 0;
> -			new_jh = new_jh->b_tnext;
> -		} while (new_jh != jh);
> -	}
> -	spin_unlock(&journal->j_list_lock);
> -
> -	/*
>  	 * Now start flushing things to disk, in the order they appear
>  	 * on the transaction lists.  Data blocks go first.
>  	 */
> @@ -490,6 +474,11 @@ void journal_commit_transaction(journal_t *journal)
>  
>  	descriptor = NULL;
>  	bufs = 0;
> +
> +	if (commit_transaction->t_nr_buffers >
> +	    commit_transaction->t_outstanding_credits)
> +		printk(KERN_EMERG "nr buffers greater than outstanding credits\n");
> +
>  	while (commit_transaction->t_buffers) {
>  
>  		/* Find the next buffer to be journaled... */
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index 3943a89..5059b79 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -1800,6 +1800,7 @@ static void __journal_remove_journal_head(struct 
> buffer_head *bh)
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_committed_data\n",
>  						__FUNCTION__);
> +				dump_stack();
>  				jbd_free(jh->b_committed_data, bh->b_size);
>  			}
>  			bh->b_private = NULL;
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 038ed74..db00d39 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -609,6 +609,13 @@ repeat:
>  		goto done;
>  
>  	/*
> +	 * This is the first time this transaction has touched this buffer
> +	 * reset b_modified and b_nr_access
> +	 */
> +	jh->b_modified = 0;
> +	jh->b_nr_access = 0;
> +
> +	/*
>  	 * If there is already a copy-out version of this buffer, then we don't
>  	 * need to make another one
>  	 */
> @@ -713,6 +720,11 @@ repeat:
>  	}
>  
>  done:
> +	/*
> +	 * Ok we got write access, update the access counter
> +	 */
> +	jh->b_nr_access++;
> +
>  	if (need_copy) {
>  		struct page *page;
>  		int offset;
> @@ -819,13 +831,33 @@ int journal_get_create_access(handle_t *handle, struct 
> buffer_head *bh)
>  	J_ASSERT_JH(jh, buffer_locked(jh2bh(jh)));
>  
>  	if (jh->b_transaction == NULL) {
> +		/*
> +		 * first time this transaction has touched this buffer,
> +		 * reset b_modified
> +		 */
> +		jh->b_modified = 0;
> +
>  		jh->b_transaction = transaction;
>  		JBUFFER_TRACE(jh, "file as BJ_Reserved");
>  		__journal_file_buffer(jh, transaction, BJ_Reserved);
>  	} else if (jh->b_transaction == journal->j_committing_transaction) {
>  		JBUFFER_TRACE(jh, "set next transaction");
> +
> +		if (jh->b_next_transaction != transaction) {
> +			/* reset b_modified */
> +			jh->b_modified = 0;
> +		}
> +
>  		jh->b_next_transaction = transaction;
>  	}
> +
> +	/*
> +	 * Nobody who calls journal_get_create_access calls
> +	 * journal_release_buffer, but if at some point somebody decides to
> +	 * we'll need this
> +	 */
> +	jh->b_nr_access++;
> +
>  	spin_unlock(&journal->j_list_lock);
>  	jbd_unlock_bh_state(bh);
>  
> @@ -1190,11 +1222,41 @@ out:
>   * journal_release_buffer: undo a get_write_access without any buffer
>   * updates, if the update decided in the end that it didn't need access.
>   *
> + * We decrement the b_nr_access as this person no longer requires write
> + * access to the buffer, and if nobody else requires write access then we want
> + * to set b_next_transaction to NULL in case this buffer is on the committing
> + * transaction, as it will get refiled onto this transaction without it being
> + * counted.  We don't worry about the case where this is the first time this
> + * jh has been used (ie jh->b_transaction == handle->h_transaction) since the
> + * beginning of the committing of this transaction will drop the jh.
>   */
>  void
>  journal_release_buffer(handle_t *handle, struct buffer_head *bh)
>  {
> +	struct journal_head *jh = bh2jh(bh);
>  	BUFFER_TRACE(bh, "entry");
> +
> +	jbd_lock_bh_state(bh);
> +	/*
> +	 * if nobody else has requested write access to this jh then set
> +	 * b_next_transaction to NULL to keep it from getting refiled onto
> +	 * this transaction
> +	 */
> +	if (!(--jh->b_nr_access)) {
> +
> +		/*
> +		 * A journal_get_undo_access()+journal_release_buffer() will
> +		 * leave undo-committed data that we no longer need
> +		 */
> +		if (jh->b_next_transaction && jh->b_committed_data) {
> +			jbd_free(jh->b_committed_data, bh->b_size);
> +			jh->b_committed_data = NULL;
> +		}
> +
> +		jh->b_next_transaction = NULL;
> +	}
> +
> +	jbd_unlock_bh_state(bh);
>  }
>  
>  /**
> @@ -1245,6 +1307,11 @@ int journal_forget (handle_t *handle, struct buffer_head 
> *bh)
>  	 */
>  	jh->b_modified = 0;
>  
> +	/*
> +	 * drop one of our write access credits
> +	 */
> +	jh->b_nr_access--;
> +
>  	if (jh->b_transaction == handle->h_transaction) {
>  		J_ASSERT_JH(jh, !jh->b_frozen_data);
>  
> diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
> index 8a62d1e..160f1d1 100644
> --- a/include/linux/journal-head.h
> +++ b/include/linux/journal-head.h
> @@ -39,6 +39,13 @@ struct journal_head {
>  	unsigned b_modified;
>  
>  	/*
> +	 * This is a counter of the number of things who have requested write
> +	 * access to this buffer by the current transaction
> +	 * [jbd_lock_bh_state()]
> +	 */
> +	unsigned b_nr_access;
> +
> +	/*
>  	 * Copy of the buffer data frozen for writing to the log.
>  	 * [jbd_lock_bh_state()]
>  	 */
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux