On Thu 03-08-17 21:17:25, Wei Wang wrote: > On 08/03/2017 08:41 PM, Michal Hocko wrote: > >On Thu 03-08-17 20:11:58, Wei Wang wrote: > >>On 08/03/2017 07:28 PM, Michal Hocko wrote: > >>>On Thu 03-08-17 19:27:19, Wei Wang wrote: > >>>>On 08/03/2017 06:44 PM, Michal Hocko wrote: > >>>>>On Thu 03-08-17 18:42:15, Wei Wang wrote: > >>>>>>On 08/03/2017 05:11 PM, Michal Hocko wrote: > >>>>>>>On Thu 03-08-17 14:38:18, Wei Wang wrote: > >>>>>[...] > >>>>>>>>+static int report_free_page_block(struct zone *zone, unsigned int order, > >>>>>>>>+ unsigned int migratetype, struct page **page) > >>>>>>>This is just too ugly and wrong actually. Never provide struct page > >>>>>>>pointers outside of the zone->lock. What I've had in mind was to simply > >>>>>>>walk free lists of the suitable order and call the callback for each one. > >>>>>>>Something as simple as > >>>>>>> > >>>>>>> for (i = 0; i < MAX_NR_ZONES; i++) { > >>>>>>> struct zone *zone = &pgdat->node_zones[i]; > >>>>>>> > >>>>>>> if (!populated_zone(zone)) > >>>>>>> continue; > >>>>>>> spin_lock_irqsave(&zone->lock, flags); > >>>>>>> for (order = min_order; order < MAX_ORDER; ++order) { > >>>>>>> struct free_area *free_area = &zone->free_area[order]; > >>>>>>> enum migratetype mt; > >>>>>>> struct page *page; > >>>>>>> > >>>>>>> if (!free_area->nr_pages) > >>>>>>> continue; > >>>>>>> > >>>>>>> for_each_migratetype_order(order, mt) { > >>>>>>> list_for_each_entry(page, > >>>>>>> &free_area->free_list[mt], lru) { > >>>>>>> > >>>>>>> pfn = page_to_pfn(page); > >>>>>>> visit(opaque2, prn, 1<<order); > >>>>>>> } > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> spin_unlock_irqrestore(&zone->lock, flags); > >>>>>>> } > >>>>>>> > >>>>>>>[...] > >>>>>>I think the above would take the lock for too long time. That's why we > >>>>>>prefer to take one free page block each time, and taking it one by one > >>>>>>also doesn't make a difference, in terms of the performance that we > >>>>>>need. > >>>>>I think you should start with simple approach and impove incrementally > >>>>>if this turns out to be not optimal. I really detest taking struct pages > >>>>>outside of the lock. You never know what might happen after the lock is > >>>>>dropped. E.g. can you race with the memory hotremove? > >>>>The caller won't use pages returned from the function, so I think there > >>>>shouldn't be an issue or race if the returned pages are used (i.e. not free > >>>>anymore) or simply gone due to hotremove. > >>>No, this is just too error prone. Consider that struct page pointer > >>>itself could get invalid in the meantime. Please always keep robustness > >>>in mind first. Optimizations are nice but it is even not clear whether > >>>the simple variant will cause any problems. > >> > >>how about this: > >> > >>for_each_populated_zone(zone) { > >> for_each_migratetype_order_decend(min_order, order, type) { > >> do { > >> => spin_lock_irqsave(&zone->lock, flags); > >> ret = report_free_page_block(zone, order, type, > >> &page)) { > >> pfn = page_to_pfn(page); > >> nr_pages = 1 << order; > >> visit(opaque1, pfn, nr_pages); > >> } > >> => spin_unlock_irqrestore(&zone->lock, flags); > >> } while (!ret) > >>} > >> > >>In this way, we can still keep the lock granularity at one free page block > >>while having the struct page operated under the lock. > >How can you continue iteration of free_list after the lock has been > >dropped? > > report_free_page_block() has handled all the possible cases after the lock > is > dropped. For example, if the previous reported page has not been on the free > list, then the first node from the list of this order will be given. This is > because > page allocation takes page blocks from the head to end, for example: > > 1,2,3,4,5,6 > if the previous reported free block is 2, when we give 2 to the report > function > to get the next page block, and find 1,2,3 have all gone, it will report 4, > which > is the head of the free list. As I've said earlier. Start simple optimize incrementally with some numbers to justify a more subtle code. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>