Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer_head

Jan Kara <jack@xxxxxxx> · Thu, 12 Sep 2024 12:16:08 +0200

On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote:
> On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@xxxxxxx> wrote:
> > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > > > >
> > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > > > getting ejected from memory to try to make room for some contiguous
> > > > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > > > upside of using GFP_MOVEABLE for the buffer cache?
> > > > >
> > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > > > extra IO as they just be migrated to free pages instead of ejected
> > > > > directly when they are the target memory area. In terms of reclaiming,
> > > > > all migrate types of page blocks possess the same position.
> > > >
> > > > Where is that being done?  I don't see any evidence of this kind of
> > > > migration in fs/buffer.c.
> > > The journaled pages which carry jh->bh are treated as file pages
> > > during isolation of a range of PFNs in the callstack below[1]. The bh
> > > will be migrated via each aops's migrate_folio and performs what you
> > > described below such as copy the content and reattach the bh to a new
> > > page. In terms of the journal enabled ext4 partition, the inode is a
> > > blockdev inode which applies buffer_migrate_folio_norefs as its
> > > migrate_folio[2].
> > >
> > > [1]
> > > cma_alloc/alloc_contig_range
> > >     __alloc_contig_migrate_range
> > >         migrate_pages
> > >             migrate_folio_move
> > >                 move_to_new_folio
> > >
> > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> > >
> > > [2]
> > > static int __buffer_migrate_folio(struct address_space *mapping,
> > >                 struct folio *dst, struct folio *src, enum migrate_mode mode,
> > >                 bool check_refs)
> > > {
> > > ...
> > >         if (check_refs) {
> > >                 bool busy;
> > >                 bool invalidated = false;
> > >
> > > recheck_buffers:
> > >                 busy = false;
> > >                 spin_lock(&mapping->i_private_lock);
> > >                 bh = head;
> > >                 do {
> > >                         if (atomic_read(&bh->b_count)) {
> > >           //My case failed here as bh is referred by a journal head.
> > >                                 busy = true;
> > >                                 break;
> > >                         }
> > >                         bh = bh->b_this_page;
> > >                 } while (bh != head);
> >
> > Correct. Currently pages with journal heads attached cannot be migrated
> > mostly out of pure caution that the generic code isn't sure what's
> > happening with them. As I wrote in [1] we could make pages with jhs on
> > checkpoint list only migratable as for them the buffer lock is enough to
> > stop anybody from touching the bh data. Bhs which are part of a running /
> > committing transaction are not realistically migratable but then these
> > states are more shortlived so it shouldn't be a big problem.
> By observing from our test case, the jh remains there for a long time
> when journal->j_free is bigger than j_max_transaction_buffers which
> failed cma_alloc. So you think this is rare or abnormal?
> 
> [6] j_free & j_max_transaction_buffers
> crash_arm64_v8.0.4++> struct
> journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
>   j_free = 0x3f1,
>   j_max_transaction_buffers = 0x100,

So jh can stay attached to the bh for a very long time (basically only
memory pressure will evict it) and this is what blocks migration. But what
I meant is that in fact, most of the time we can migrate bh with jh
attached just fine. There are only relatively short moments (max 5s) where
a buffer (jh) is part of a running or committing transaction and then we
cannot really migrate.

> > > > > > Just curious, because in general I'm blessed by not having to use CMA
> > > > > > in the first place (not having I/O devices too primitive so they can't
> > > > > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > > > > missing some of the design considerations behind CMA.  I thought in
> > > > > > general CMA tends to used in early boot to allocate things like frame
> > > > > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > > > > clearly not the case for you, apparently?
> > > > >
> > > > > Yes. CMA is designed for contiguous physical memory and has been used
> > > > > via cma_alloc during the whole lifetime especially on the system
> > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > > > > blocks, they also could have compaction path retry for many times
> > > > > which is common during high-order alloc_pages.
> > > >
> > > > But then what's the point of using CMA-eligible memory for the buffer
> > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> > > > cache allocations?  After all, that's what is being proposed for
> > > > ext4's ext4_getblk().  What's the downside of avoiding the use of
> > > > CMA-eligible memory for ext4's buffer cache?  Why not do this for
> > > > *all* buffers in the buffer cache?
> > > Since migration which arised from alloc_pages or cma_alloc always
> > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
> > > cache pages under regular files are the best candidate for migration
> > > as we just need to modify page cache and PTE. Actually, all FSs apply
> > > GFP_MOVABLE on their regular files via the below functions.
> > >
> > > new_inode
> > >     alloc_inode
> > >         inode_init_always(struct super_block *sb, struct inode *inode)
> > >         {
> > >          ...
> > >             mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> >
> > Here you speak about data page cache pages. Indeed they can be allocated
> > from CMA area. But when Ted speaks about "buffer cache" he specifically
> > means page cache of the block device inode and there I can see:
> >
> > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
> > {
> > ...
> >         mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> > ...
> > }
> >
> > so at this point I'm confused how come you can see block device pages in
> > CMA area. Are you using data=journal mode of ext4 in your setup by any
> > chance? That would explain it but then that is a horrible idea as well...
> The page of 'fffffffe01a51c00'[1] which has bh attached comes from
> creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where
> the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file
> pages under the super_block but not available for metadata, right?

Ah, right, __getblk() overrides the GFP mode set in bdev's mapping_gfp_mask
and sets __GFP_MOVABLE there. This behavior goes way back to 2007
(769848c03895b ("Add __GFP_MOVABLE for callers to flag allocations from
high memory that may be migrated")). So another clean and simple way to fix
this (as Ted suggests) would be to stop setting __GFP_MOVABLE in
__getblk(). For ext4 there would be no practical difference to current
situation where metadata pages are practically unmovable due to attached
jhs, other filesystems can set __GFP_MOVABLE in bdev's mapping_gfp_mask if
they care (but I don't think there are other big buffer cache users on
systems which need CMA).

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR