On Fri, Dec 27, 2024 at 10:38:45PM -0500, Gregory Price wrote: > On Fri, Dec 27, 2024 at 02:09:50PM -0500, Gregory Price wrote: > > This seems to imply that the overhead we're seeing from read() even > when filecache is on the remote node isn't actually related to the > memory speed, but instead likely related to some kind of stale > metadata in the filesystem or filecache layers. > > ~Gregory Mystery solved > +void promotion_candidate(struct folio *folio) > +{ ... snip ... > + list_add(&folio->lru, promo_list); > +} read(file, length) will do a linear read, and promotion_candidate will add those pages to the promotion list head resulting into a reversed promotion order so you read [1,2,3,4] folios, you'll promote in [4,3,2,1] order. The result of this, on an unloaded system, is essentially that pages end up in the worst possible configuration for the prefetcher, and therefore TLB hits. I figured this out because i was seeing the additional ~30% overhead show up purely in `copy_page_to_iter()` (i.e. copy_to_user). Swapping this for list_add_tail results in the following test result: initializing Read loop took 9.41 seconds <- reading from CXL Read loop took 31.74 seconds <- migration enabled Read loop took 10.31 seconds Read loop took 7.71 seconds <- migration finished Read loop took 7.71 seconds Read loop took 7.70 seconds Read loop took 7.75 seconds Read loop took 19.34 seconds <- dropped caches Read loop took 13.68 seconds <- cache refilling to DRAM Read loop took 7.37 seconds Read loop took 7.68 seconds Read loop took 7.65 seconds <- back to DRAM baseline On our CXL devices, we're seeing a 22-27% performance penalty for a file being hosted entirely out of CXL. When we promote this file out of CXL, we set a 22-27% performance boost. Probably list_add_tail is right here, but since files *tend to* be read linearly with `read()` this should *tend toward* optimal. That said, we can probably make this more reliable by adding batch migration function `mpol_migrate_misplaced_batch()` which also tries to do bulk allocation of destination folios. This will also probably save us a bunch of invalidation overhead. I'm also noticing that the migration limit (256mbps) is not being respected, probably because we're doing 1 folio at a time instead of a batch. Will probably look at changing promotion_candidate to limit the number of selected pages to promote per read-call. --- diff --git a/mm/migrate.c b/mm/migrate.c index f965814b7d40..99b584f22bcb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2675,7 +2675,7 @@ void promotion_candidate(struct folio *folio) folio_putback_lru(folio); return; } - list_add(&folio->lru, promo_list); + list_add_tail(&folio->lru, promo_list); return; }