Re: [PATCH 0/5] fs/buffer: strack reduction on async read

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Fri, 31 Jan 2025 08:54:31 -0800

On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > > This splits up a minor enhancement from the bs > ps device support
> > > > series into its own series for better review / focus / testing.
> > > > This series just addresses the reducing the array size used and cleaning
> > > > up the async read to be easier to read and maintain.
> > > 
> > > How about this approach instead -- get rid of the batch entirely?
> > 
> > Less is more! I wish it worked, but we end up with a null pointer on
> > ext4/032 (and indeed this is the test that helped me find most bugs in
> > what I was working on):
> 
> Yeah, I did no testing; just wanted to give people a different approach
> to consider.
> 
> > [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> > [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
> 
> That decodes as:
> 
>    5:	53                   	push   %rbx
>    6:	48 8b 47 10          	mov    0x10(%rdi),%rax
>    a:	48 89 fb             	mov    %rdi,%rbx
>    d:	48 8b 40 18          	mov    0x18(%rax),%rax
>   11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
>   14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)
> 
> 6: bh->b_folio
> d: b_folio->mapping
> 11: mapping->host
> 
> So folio->mapping is NULL.
> 
> Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> test to decide if all buffers on the page are uptodate or not.  So both
> having no batch (ie this patch) and having a batch which is smaller than
> the number of buffers in the folio can lead to folio_end_read() being
> called prematurely (ie we'll unlock the folio before finishing reading
> every buffer in the folio).

But:

a) all batched buffers are locked in the old code, we only unlock
   the currently evaluated buffer, the buffers from our pivot are locked
   and should also have the async flag set. That fact that buffers ahead
   should have the async flag set should prevent from calling
   folio_end_read() prematurely as I read the code, no?

b) In the case we're evaluting the last buffer, we can unlock and call
   folio_end_read(), but that seems fine given the previous batch work
   was in charge of finding candidate buffers which need a read, so in
   this case there should be no pending read.

So I don't see how we yet can call folio_end_read() prematurely.

We do however unlock the buffer in end_buffer_async_read(), but in case
of an inconsistency we simply bail on the loop, and since we only called
end_buffer_async_read() in case of the buffer being up to date, the only
iconsistency we check for is if (!buffer_uptodate(tmp)) in which case
the folio_end_read() will be called but without a success being annoted.

> Once the folio is unlocked, it can be truncated.  That's a second-order
> problem, but it's the one your test happened to hit.
> 
> 
> This should fix the problem; we always have at least one BH held in
> the submission path with the async_read flag set, so
> end_buffer_async_read() will not end it prematurely.

But this alternative does not call end_buffer_async_read(), in fact
we only call folio_end_read() in case of no pending reads being needed.

> diff --git a/fs/buffer.c b/fs/buffer.c
> index cc8452f60251..fd2633e4a5d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  {
>  	struct inode *inode = folio->mapping->host;
>  	sector_t iblock, lblock;
> -	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> +	struct buffer_head *bh, *head, *prev = NULL;
>  	size_t blocksize;
> -	int nr, i;
> +	int i;
>  	int fully_mapped = 1;
>  	bool page_error = false;
>  	loff_t limit = i_size_read(inode);
> @@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  	iblock = div_u64(folio_pos(folio), blocksize);
>  	lblock = div_u64(limit + blocksize - 1, blocksize);
>  	bh = head;
> -	nr = 0;
>  	i = 0;
>  
>  	do {
> @@ -2411,40 +2410,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  			if (buffer_uptodate(bh))
>  				continue;
>  		}
> -		arr[nr++] = bh;
> +
> +		lock_buffer(bh);
> +		if (buffer_uptodate(bh)) {
> +			unlock_buffer(bh);
> +			continue;
> +		}
> +
> +		mark_buffer_async_read(bh);
> +		if (prev)
> +			submit_bh(REQ_OP_READ, prev);
> +		prev = bh;
>  	} while (i++, iblock++, (bh = bh->b_this_page) != head);
>  
>  	if (fully_mapped)
>  		folio_set_mappedtodisk(folio);
>  
> -	if (!nr) {
> -		/*
> -		 * All buffers are uptodate or get_block() returned an
> -		 * error when trying to map them - we can finish the read.
> -		 */
> -		folio_end_read(folio, !page_error);
> -		return 0;
> -	}
> -
> -	/* Stage two: lock the buffers */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		lock_buffer(bh);
> -		mark_buffer_async_read(bh);
> -	}
> -
>  	/*
> -	 * Stage 3: start the IO.  Check for uptodateness
> -	 * inside the buffer lock in case another process reading
> -	 * the underlying blockdev brought it uptodate (the sct fix).
> +	 * All buffers are uptodate or get_block() returned an error
> +	 * when trying to map them - we must finish the read because
> +	 * end_buffer_async_read() will never be called on any buffer
> +	 * in this folio.

Do we want to keep mentioning end_buffer_async_read() here?

>  	 */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		if (buffer_uptodate(bh))
> -			end_buffer_async_read(bh, 1);
> -		else
> -			submit_bh(REQ_OP_READ, bh);
> -	}
> +	if (prev)
> +		submit_bh(REQ_OP_READ, prev);
> +	else
> +		folio_end_read(folio, !page_error);
> +
>  	return 0;

Becuase we only call folio_end_read() in the above code in case we had
no pending read IO determined. In case we had to at least issue one read
for one buffer we never call folio_end_read(). We didn't before unless
we ran into a race where a pending batched read coincided with a read
being issued and updating our buffer by chance, and we determined we
either completed that read fine or with an error.

Reason I'm asking these things is I'm trying to determine if there was
an issue before we're trying to fix other than the simplification with
the new un-batched strategy. I don't see it yet. If we're fixing
something here, it is still a bit obscure to me and I'd like to make
sure we document it properly.

  Luis