On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote: > On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@xxxxxxx> wrote: > > > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote: > > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@xxxxxxx> wrote: > > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > > > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > > > > > cache might get thrashed a lot by having a lot of cached disk buffers > > > > > > getting ejected from memory to try to make room for some contiguous > > > > > > frame buffer memory, which means extra I/O overhead. So what's the > > > > > > upside of using GFP_MOVEABLE for the buffer cache? > > > > > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > > > > > extra IO as they just be migrated to free pages instead of ejected > > > > > directly when they are the target memory area. In terms of reclaiming, > > > > > all migrate types of page blocks possess the same position. > > > > > > > > Where is that being done? I don't see any evidence of this kind of > > > > migration in fs/buffer.c. > > > The journaled pages which carry jh->bh are treated as file pages > > > during isolation of a range of PFNs in the callstack below[1]. The bh > > > will be migrated via each aops's migrate_folio and performs what you > > > described below such as copy the content and reattach the bh to a new > > > page. In terms of the journal enabled ext4 partition, the inode is a > > > blockdev inode which applies buffer_migrate_folio_norefs as its > > > migrate_folio[2]. > > > > > > [1] > > > cma_alloc/alloc_contig_range > > > __alloc_contig_migrate_range > > > migrate_pages > > > migrate_folio_move > > > move_to_new_folio > > > > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio) > > > > > > [2] > > > static int __buffer_migrate_folio(struct address_space *mapping, > > > struct folio *dst, struct folio *src, enum migrate_mode mode, > > > bool check_refs) > > > { > > > ... > > > if (check_refs) { > > > bool busy; > > > bool invalidated = false; > > > > > > recheck_buffers: > > > busy = false; > > > spin_lock(&mapping->i_private_lock); > > > bh = head; > > > do { > > > if (atomic_read(&bh->b_count)) { > > > //My case failed here as bh is referred by a journal head. > > > busy = true; > > > break; > > > } > > > bh = bh->b_this_page; > > > } while (bh != head); > > > > Correct. Currently pages with journal heads attached cannot be migrated > > mostly out of pure caution that the generic code isn't sure what's > > happening with them. As I wrote in [1] we could make pages with jhs on > > checkpoint list only migratable as for them the buffer lock is enough to > > stop anybody from touching the bh data. Bhs which are part of a running / > > committing transaction are not realistically migratable but then these > > states are more shortlived so it shouldn't be a big problem. > By observing from our test case, the jh remains there for a long time > when journal->j_free is bigger than j_max_transaction_buffers which > failed cma_alloc. So you think this is rare or abnormal? > > [6] j_free & j_max_transaction_buffers > crash_arm64_v8.0.4++> struct > journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x > j_free = 0x3f1, > j_max_transaction_buffers = 0x100, So jh can stay attached to the bh for a very long time (basically only memory pressure will evict it) and this is what blocks migration. But what I meant is that in fact, most of the time we can migrate bh with jh attached just fine. There are only relatively short moments (max 5s) where a buffer (jh) is part of a running or committing transaction and then we cannot really migrate. > > > > > > Just curious, because in general I'm blessed by not having to use CMA > > > > > > in the first place (not having I/O devices too primitive so they can't > > > > > > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > > > > > > missing some of the design considerations behind CMA. I thought in > > > > > > general CMA tends to used in early boot to allocate things like frame > > > > > > buffers, and after that CMA doesn't tend to get used at all? That's > > > > > > clearly not the case for you, apparently? > > > > > > > > > > Yes. CMA is designed for contiguous physical memory and has been used > > > > > via cma_alloc during the whole lifetime especially on the system > > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page > > > > > blocks, they also could have compaction path retry for many times > > > > > which is common during high-order alloc_pages. > > > > > > > > But then what's the point of using CMA-eligible memory for the buffer > > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer > > > > cache allocations? After all, that's what is being proposed for > > > > ext4's ext4_getblk(). What's the downside of avoiding the use of > > > > CMA-eligible memory for ext4's buffer cache? Why not do this for > > > > *all* buffers in the buffer cache? > > > Since migration which arised from alloc_pages or cma_alloc always > > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer > > > cache pages under regular files are the best candidate for migration > > > as we just need to modify page cache and PTE. Actually, all FSs apply > > > GFP_MOVABLE on their regular files via the below functions. > > > > > > new_inode > > > alloc_inode > > > inode_init_always(struct super_block *sb, struct inode *inode) > > > { > > > ... > > > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); > > > > Here you speak about data page cache pages. Indeed they can be allocated > > from CMA area. But when Ted speaks about "buffer cache" he specifically > > means page cache of the block device inode and there I can see: > > > > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) > > { > > ... > > mapping_set_gfp_mask(&inode->i_data, GFP_USER); > > ... > > } > > > > so at this point I'm confused how come you can see block device pages in > > CMA area. Are you using data=journal mode of ext4 in your setup by any > > chance? That would explain it but then that is a horrible idea as well... > The page of 'fffffffe01a51c00'[1] which has bh attached comes from > creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where > the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file > pages under the super_block but not available for metadata, right? Ah, right, __getblk() overrides the GFP mode set in bdev's mapping_gfp_mask and sets __GFP_MOVABLE there. This behavior goes way back to 2007 (769848c03895b ("Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated")). So another clean and simple way to fix this (as Ted suggests) would be to stop setting __GFP_MOVABLE in __getblk(). For ext4 there would be no practical difference to current situation where metadata pages are practically unmovable due to attached jhs, other filesystems can set __GFP_MOVABLE in bdev's mapping_gfp_mask if they care (but I don't think there are other big buffer cache users on systems which need CMA). Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR