On 5/16/24 11:32 PM, Karim Manaouil wrote: > On Thu, May 16, 2024 at 02:05:24PM -0600, Yu Zhao wrote: >> For example, if we have two systems, one has lower fragmentation for >> some orders but higher fragmentation for the rest, and the other is >> the opposite. How would we be able to use a single measure to describe >> this? IOW, I don't think a single measurement can describe all orders >> in a comparable way, which would be the weakest requirement we would >> have to impose. > >> As I (badly) explained earlier, a single value can't do that because >> different orders are not on the same footing (or so to speak), unless >> we are only interested in one non-zero order. So we would need >> fragmentation_index[NR_non_zero_orders]. > >> No, for example, A can allocate 4 order-1 but 0 order-2, and B can >> allocate 2 order-1 *or* 1 order-2, which one would you say is better >> or worse? This, IMO, depends on which order you are trying to >> allocate. Does it make sense? > > But higher order pages can always be broken down into lower order pages. > However, the inverse is not always gauranteed (they may not be buddies, > or compaction/reclaim isn't helpful). > > Obviously, I would rather have one order-4 page than two order-3 pages. > You can always satisfy an allocation for an order n if a page with an > order higher than n is available. > > One way to measure fragmentation is to compare how far we are from some > perfect value. The perfect value represents the case when all the free > memory is available as blocks of pageblock_order or MAX_PAGE_ORDER. > > I can do this as a one shot calculation, for example with > > static void estimate_numa_fragmentation(void) > { > pg_data_t *pgdat; > struct zone *z; > unsigned long fragscore; > unsigned long bestscore; > unsigned long nr_free; > int order; > > for_each_online_pgdat(pgdat) { > nr_free = fragscore = 0; > z = pgdat->node_zones; > while (z < (pgdat->node_zones + pgdat->nr_zones)) { > if (!populated_zone(z)) { > z++; > continue; > } > spin_lock_irq(&z->lock); > for (order = 0; order < NR_PAGE_ORDERS; order++) { > nr_free += z->free_area[order].nr_free << order; > fragscore += z->free_area[order].nr_free << (order * 2); > } > spin_unlock_irq(&z->lock); > z++; > cond_resched(); > } > bestscore = nr_free << MAX_PAGE_ORDER; > fragscore = ((bestscore - fragscore) * 100) / bestscore; > pr_info("fragscore on node %d: %lu\n", pgdat->node_id, fragscore); > } > } I've had a similar idea, which is maybe exactly the same as yours, I don't immediately see whether that's the case and I have not formalized mine, just have an intuitive explanation. But the basic premise is the same, if "continuity" is between 0 and 100, then all memory available in MAX_PAGE_ORDER should score 100. Then e.g. if we have 50% of the memory in MAX_PAGE_ORDER, then having the other 50% fully available in MAX_PAGE_ORDER-1 should give us another 25%, for a total of 75%, and so on. Maybe the decay of contribution to the continuity metrics for decreasing an order by 1 should be less than 2x, not sure. I think in case of Yu Zhao's example "A can allocate 4 order-1 but 0 order-2, and B can allocate 2 order-1 *or* 1 order-2" I think this should result in the same score at least? And of course it's not useful if we don't know what allocations we actually need, but maybe useful enough to compare two systems. > But there must be a way to streamline the calculation and update the value > with low overhead over time. I think instead of trying to track anything in the kernel and set it in stone, the metric should be calculated in userspace after reading the base values from /proc/buddyinfo (not fragindex) and maybe ignore small special zones like ZONE_DMA, and the per-order numbers of other zones could be summed together. Could also differentiate per migratetype using /proc/pagetypeinfo but then it's no longer a single number. > Cheers > Karim > PhD Student > Edinburgh University >