On 15.07.19 11:33, David Hildenbrand wrote: > On 11.07.19 20:21, Dave Hansen wrote: >> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote: >>> +static void bm_set_pfn(struct page *page) >>> +{ >>> + struct zone *zone = page_zone(page); >>> + int zone_idx = page_zonenum(page); >>> + unsigned long bitnr = 0; >>> + >>> + lockdep_assert_held(&zone->lock); >>> + bitnr = pfn_to_bit(page, zone_idx); >>> + /* >>> + * TODO: fix possible underflows. >>> + */ >>> + if (free_area[zone_idx].bitmap && >>> + bitnr < free_area[zone_idx].nbits && >>> + !test_and_set_bit(bitnr, free_area[zone_idx].bitmap)) >>> + atomic_inc(&free_area[zone_idx].free_pages); >>> +} >> >> Let's say I have two NUMA nodes, each with ZONE_NORMAL and ZONE_MOVABLE >> and each zone with 1GB of memory: >> >> Node: 0 1 >> NORMAL 0->1GB 2->3GB >> MOVABLE 1->2GB 3->4GB >> >> This code will allocate two bitmaps. The ZONE_NORMAL bitmap will >> represent data from 0->3GB and the ZONE_MOVABLE bitmap will represent >> data from 1->4GB. That's the result of this code: >> >>> + if (free_area[zone_idx].base_pfn) { >>> + free_area[zone_idx].base_pfn = >>> + min(free_area[zone_idx].base_pfn, >>> + zone->zone_start_pfn); >>> + free_area[zone_idx].end_pfn = >>> + max(free_area[zone_idx].end_pfn, >>> + zone->zone_start_pfn + >>> + zone->spanned_pages); >> >> But that means that both bitmaps will have space for PFNs in the other >> zone type, which is completely bogus. This is fundamental because the >> data structures are incorrectly built per zone *type* instead of per zone. >> > > I don't think it's incorrect, it's just not optimal in all scenarios. > E.g., in you example, this approach would "waste" 2 * 1GB of tracking > data for the wholes (2* 64bytes when using 1 bit for 2MB). > > FWIW, this is not a numa-specific thingy. We can have sparse zones > easily on single-numa systems. > > Node: 0 > NORMAL 0->1GB, 2->3GB > MOVABLE 1->2GB, 3->4GB > > So tracking it per zones instead instead of zone type is only one part > of the story. > Oh, and FWIW, in setups like Node: 0 1 NORMAL 4->5GB, 6->7GB 5->6GB, 8->9GB What Nitesh proposes is actually better. So it really depends on the use case - but in general sparsity is the issue. -- Thanks, David / dhildenb