On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@xxxxxxx> wrote: > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote: > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@xxxxxxx> wrote: > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote: > > > > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer > > > > > cache might get thrashed a lot by having a lot of cached disk buffers > > > > > getting ejected from memory to try to make room for some contiguous > > > > > frame buffer memory, which means extra I/O overhead. So what's the > > > > > upside of using GFP_MOVEABLE for the buffer cache? > > > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce > > > > extra IO as they just be migrated to free pages instead of ejected > > > > directly when they are the target memory area. In terms of reclaiming, > > > > all migrate types of page blocks possess the same position. > > > > > > Where is that being done? I don't see any evidence of this kind of > > > migration in fs/buffer.c. > > The journaled pages which carry jh->bh are treated as file pages > > during isolation of a range of PFNs in the callstack below[1]. The bh > > will be migrated via each aops's migrate_folio and performs what you > > described below such as copy the content and reattach the bh to a new > > page. In terms of the journal enabled ext4 partition, the inode is a > > blockdev inode which applies buffer_migrate_folio_norefs as its > > migrate_folio[2]. > > > > [1] > > cma_alloc/alloc_contig_range > > __alloc_contig_migrate_range > > migrate_pages > > migrate_folio_move > > move_to_new_folio > > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio) > > > > [2] > > static int __buffer_migrate_folio(struct address_space *mapping, > > struct folio *dst, struct folio *src, enum migrate_mode mode, > > bool check_refs) > > { > > ... > > if (check_refs) { > > bool busy; > > bool invalidated = false; > > > > recheck_buffers: > > busy = false; > > spin_lock(&mapping->i_private_lock); > > bh = head; > > do { > > if (atomic_read(&bh->b_count)) { > > //My case failed here as bh is referred by a journal head. > > busy = true; > > break; > > } > > bh = bh->b_this_page; > > } while (bh != head); > > Correct. Currently pages with journal heads attached cannot be migrated > mostly out of pure caution that the generic code isn't sure what's > happening with them. As I wrote in [1] we could make pages with jhs on > checkpoint list only migratable as for them the buffer lock is enough to > stop anybody from touching the bh data. Bhs which are part of a running / > committing transaction are not realistically migratable but then these > states are more shortlived so it shouldn't be a big problem. By observing from our test case, the jh remains there for a long time when journal->j_free is bigger than j_max_transaction_buffers which failed cma_alloc. So you think this is rare or abnormal? [6] j_free & j_max_transaction_buffers crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x j_free = 0x3f1, j_max_transaction_buffers = 0x100, > > > > > > Just curious, because in general I'm blessed by not having to use CMA > > > > > in the first place (not having I/O devices too primitive so they can't > > > > > do scatter-gather :-). So I don't tend to use CMA, and obviously I'm > > > > > missing some of the design considerations behind CMA. I thought in > > > > > general CMA tends to used in early boot to allocate things like frame > > > > > buffers, and after that CMA doesn't tend to get used at all? That's > > > > > clearly not the case for you, apparently? > > > > > > > > Yes. CMA is designed for contiguous physical memory and has been used > > > > via cma_alloc during the whole lifetime especially on the system > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page > > > > blocks, they also could have compaction path retry for many times > > > > which is common during high-order alloc_pages. > > > > > > But then what's the point of using CMA-eligible memory for the buffer > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer > > > cache allocations? After all, that's what is being proposed for > > > ext4's ext4_getblk(). What's the downside of avoiding the use of > > > CMA-eligible memory for ext4's buffer cache? Why not do this for > > > *all* buffers in the buffer cache? > > Since migration which arised from alloc_pages or cma_alloc always > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer > > cache pages under regular files are the best candidate for migration > > as we just need to modify page cache and PTE. Actually, all FSs apply > > GFP_MOVABLE on their regular files via the below functions. > > > > new_inode > > alloc_inode > > inode_init_always(struct super_block *sb, struct inode *inode) > > { > > ... > > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); > > Here you speak about data page cache pages. Indeed they can be allocated > from CMA area. But when Ted speaks about "buffer cache" he specifically > means page cache of the block device inode and there I can see: > > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) > { > ... > mapping_set_gfp_mask(&inode->i_data, GFP_USER); > ... > } > > so at this point I'm confused how come you can see block device pages in > CMA area. Are you using data=journal mode of ext4 in your setup by any > chance? That would explain it but then that is a horrible idea as well... The page of 'fffffffe01a51c00'[1] which has bh attached comes from creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file pages under the super_block but not available for metadata, right? [1] crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping) fffffffe01a51c00 e9470000 ffffff808f0aa150 3 2 8000000008020 lru,private //within CMA area fffffffe03d189c0 174627000 ffffff808f0aa150 4 2 2004000000008020 lru,private fffffffe03d88e00 176238000 ffffff808f0aa150 3f9 2 2008000000008020 lru,private fffffffe03d88e40 176239000 ffffff808f0aa150 6 2 2008000000008020 lru,private fffffffe03d88e80 17623a000 ffffff808f0aa150 5 2 2008000000008020 lru,private fffffffe03d88ec0 17623b000 ffffff808f0aa150 1 2 2008000000008020 lru,private fffffffe03d88f00 17623c000 ffffff808f0aa150 0 2 2008000000008020 lru,private fffffffe040e6540 183995000 ffffff808f0aa150 3f4 2 2004000000008020 lru,private [2] 02148 < 4> [ 14.133703] [08-11 18:38:25.133] __find_get_block+0x29c/0x634 02149 < 4> [ 14.133711] [08-11 18:38:25.133] __getblk_gfp+0xa8/0x290 0214A < 4> [ 14.133716] [08-11 18:38:25.133] ext4_read_inode_bitmap+0xa0/0x6c4 0214B < 4> [ 14.133725] [08-11 18:38:25.133] __ext4_new_inode+0x34c/0x10d4 0214C < 4> [ 14.133730] [08-11 18:38:25.133] ext4_create+0xdc/0x1cc 0214D < 4> [ 14.133737] [08-11 18:38:25.133] path_openat+0x4fc/0xc84 0214E < 4> [ 14.133745] [08-11 18:38:25.133] do_filp_open+0xc0/0x16c 0214F < 4> [ 14.133751] [08-11 18:38:25.133] do_sys_openat2+0x8c/0xf8 02150 < 4> [ 14.133758] [08-11 18:38:25.133] __arm64_sys_openat+0x78/0xa4 02151 < 4> [ 14.133764] [08-11 18:38:25.133] invoke_syscall+0x60/0x11c 02152 < 4> [ 14.133771] [08-11 18:38:25.133] el0_svc_common+0xb4/0xe8 02153 < 4> [ 14.133777] [08-11 18:38:25.133] do_el0_svc+0x24/0x30 02154 < 4> [ 14.133783] [08-11 18:38:25.133] el0_svc+0x3c/0x70 > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR