Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Sat, 15 Apr 2023 22:26:42 -0700

On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote:
> On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote:
> > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote:
> > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> > > > On 4/15/23 05:44, Matthew Wilcox wrote:
> > > > We _could_ upgrade to always do full page I/O; there's a good
> > > > chance we'll be using the entire page anyway eventually.
> > 
> > *Iff* doing away with buffer head 512 granularity could help block sizes
> > greater than page size where physical and logical block size > PAGE_SIZE
> > we whould also be able to see it on 4kn drives (logical and physical
> > block size == 4k). A projection could be made after.
> > 
> > In so far as experimenting with this, if you already have some
> > effort on IOMAP for bdev aops one possibility for pure experimentation
> > for now would be to peg a new set of aops could be set in the path
> > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early
> > for us to know if the device is has (lbs = pbs) > 512. For NVMe for
> > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We
> > put together and set the logical and phyiscal block size on NVMe on
> > nvme_update_ns_info() --> nvme_update_disk_info(), right before we
> > call device_add_disk(). The only way to override the aops then would
> > be right before device_add_disk(), or as part of a new device_add_disk_aops()
> > or whatever.
> 
> I think you're making this harder than you need to.
> For LBA size > PAGE_SIZE, there is always only going to be
> one BH per folio-that-is-LBA-size, so all the problems with
> more-than-8-BHs-per-4k-page don't actually exist.

Oh! Then yes, sorry!

> I don't think we
> should be overriding the aops, and if we narrow the scope of large folio
> support in blockdev t only supporting folio_size == LBA size, it becomes
> much more feasible.

I'm trying to think of the possible use cases where folio_size != LBA size
and I cannot immediately think of some. Yes there are cases where a
filesystem may use a different block for say meta data than data, but that
I believe is side issue, ie, read/writes for small metadata would have
to be accepted. At least for NVMe we have metadata size as part of the
LBA format, but from what I understand no Linux filesystem yet uses that.

> > > > And with storage bandwidth getting larger and larger we might even
> > > > get a performance boost there.
> > > 
> > > I think we need to look at this from the filesystem side.
> > 
> > Before that let's recap the bdev cache current issues.
> 
> Ooh, yes, this is good!  I was totally neglecting the partition
> scanning code.
> 
> > Today by just adding the disk we move on to partition scanning
> > immediately unless your block driver has a flag that says otherwise. The
> > current crash we're evaluating with brd and that we also hit with NVMe
> > is due to this part.
> > 
> > device_add_disk() -->
> >   disk_scan_partitions() -->
> >     blkdev_get_whole() -->
> >       bdev_disk_changed() -->
> >         filemap_read_folio() --> filler()
> > 
> > The filler is from aops.
> 
> Right, so you missed a step in that callchain, which is
> read_mapping_folio().  That ends up in do_read_cache_folio(), which
> contains the deadly:
> 
>                 folio = filemap_alloc_folio(gfp, 0);

Right and before this we have:

	if (!filler)                                                            
		filler = mapping->a_ops->read_folio;

The folio is order 0 and after filemap_alloc_folio() its added to the
page cache, then filemap_read_folio(file, filler, folio)
uses the filler blkdev_read_folio()  --> fs/buffer.c block_read_full_folio()

Just for posterity:

fs/buffer.c
int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
	...
	struct buffer_head *bh, *head,...;
	...
	head = create_page_buffers(&folio->page, inode, 0);
	...
}

static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
{                                                                               
        BUG_ON(!PageLocked(page));                                              

        if (!page_has_buffers(page))                                            
                create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),    
                                     b_state);                                  
        return page_buffers(page);                                              
}

void create_empty_buffers(struct page *page,                                    
                        unsigned long blocksize, unsigned long b_state)         
{
        struct buffer_head *bh, *head, *tail;                                   

        head = alloc_page_buffers(page, blocksize, true);
        bh = head; 
        do {                                                                    
                bh->b_state |= b_state; -----> CRASH HERE head is NULL
                tail = bh;                                                      
                bh = bh->b_this_page;                                           
        } while (bh);                                                           
        tail->b_this_page = head;
}

Why is it NULL? The blocksize passed to alloc_page_buffers() is larger
than PAGE_SIZE and below offset is PAGE_SIZE. PAGE_SIZE - blocksize
gives a negative number and so the loop is not traversed.

struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,   
                bool retry)                                                     
{ 
        struct buffer_head *bh, *head;                                          
        gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;                                   
        long offset;                                                            
        struct mem_cgroup *memcg, *old_memcg;                                   

        if (retry)                                                              
                gfp |= __GFP_NOFAIL;                                            

        /* The page lock pins the memcg */ 
        memcg = page_memcg(page);                                               
        old_memcg = set_active_memcg(memcg);                                    

        head = NULL;  
        offset = PAGE_SIZE;                                                     
        while ((offset -= size) >= 0) {                                         
                bh = alloc_buffer_head(gfp);                                    
                if (!bh)                                                        
                        goto no_grow;                                           

                bh->b_this_page = head;                                         
                bh->b_blocknr = -1;                                             
                head = bh;                                                      

                bh->b_size = size;                                              

                /* Link the buffer to its page */                               
                set_bh_page(bh, page, offset);                                  
        }                                                                       
out:                                                                            
        set_active_memcg(old_memcg);                                            
        return head;
	...
}

I see now what you say about the buffer head being of the block size
bh->b_size = size above.

> so that needs to be changed to get the minimum folio order from the
> mapping, and then it should work.
> 
> > > What do filesystems actually want to do?
> > 
> > So you are suggesting that the early reads of the block device by the
> > block cache and its use of the page cache cache should be aligned /
> > perhaps redesigned to assist more clearly with what modern filesystems
> > might actually would want today?
> 
> Perhaps?  I'm just saying the needs of the block device are not the
> only consideration here.  I'd like an API that makes sense for the fs
> author.

Makes sense!

> > > The first thing is they want to read
> > > the superblock.  That's either going to be immediately freed ("Oh,
> > > this isn't a JFS filesystem after all") or it's going to hang around
> > > indefinitely.  There's no particular need to keep it in any kind of
> > > cache (buffer or page). 
> > 
> > And the bdev cache would not be able to know before hand that's the case.
> 
> Right, nobody knows until it's been read and examined.
> 
> > > Except ... we want to probe a dozen different
> > > filesystems, and half of them keep their superblock at the same offset
> > > from the start of the block device.  So we do want to keep it cached.
> > > That's arguing for using the page cache, at least to read it.
> > 
> > Do we currently share anything from the bdev cache with the fs for this?
> > Let's say that first block device blocksize in memory.
> 
> sb_bread() is used by most filesystems, and the buffer cache aliases
> into the page cache.

I see thanks. I checked what xfs does and its xfs_readsb() uses its own
xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
they did that.

> > > Now, do we want userspace to be able to dd a new superblock into place
> > > and have the mounted filesystem see it? 
> > 
> > Not sure I follow this. dd a new super block?
> 
> In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> I can overwrite the superblock.  Do we want filesystems to see that
> kind of vandalism, or do we want the mounted filesystem to have its
> own copy of the data and overwrite what userspace wrote the next time it
> updates the superblock?

Oh, what happens today?

> (the trick is that this may not be vandalism, it might be the sysadmin
> updating the uuid or running some fsck-ish program or trying to update
> the superblock to support fabulous-new-feature on next mount.  does this
> change the answer?)
> 
> > > I suspect that confuses just
> > > about every filesystem out there.  So I think the right answer is to read
> > > the page into the bdev's page cache and then copy it into a kmalloc'ed
> > > buffer which the filesystem is then responsible for freeing.  It's also
> > > responsible for writing it back (so that's another API we need), and for
> > > a journalled filesystem, it needs to fit into the journalling scheme.
> > > Also, we may need to write back multiple copies of the superblock,
> > > possibly with slight modifications.
> > 
> > Are you considering these as extentions to the bdev cache?
> 
> I'm trying to suggest some of the considerations that need to go into
> a replacement for sb_bread().

I see! That would also help EOL buffer heads for that purpose.

  Luis