On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang <wei.w.wang@xxxxxxxxx> wrote: > From: Liang Li <liang.z.li@xxxxxxxxx> > > This patch adds a function to provides a snapshot of the present system > unused pages. An important usage of this function is to provide the > unsused pages to the Live migration thread, which skips the transfer of > thoses unused pages. Newly used pages can be re-tracked by the dirty > page logging mechanisms. I don't think this will be useful for anything other than virtio-balloon. I guess it would be better to keep this code in the virtio-balloon driver if possible, even though that's rather a layering violation :( What would have to be done to make that possible? Perhaps we can put some *small* helpers into page_alloc.c to prevent things from becoming too ugly. > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter) > show_swap_cache_info(); > } > > +static int __record_unused_pages(struct zone *zone, int order, > + __le64 *buf, unsigned int size, > + unsigned int *offset, bool part_fill) > +{ > + unsigned long pfn, flags; > + int t, ret = 0; > + struct list_head *curr; > + __le64 *chunk; > + > + if (zone_is_empty(zone)) > + return 0; > + > + spin_lock_irqsave(&zone->lock, flags); > + > + if (*offset + zone->free_area[order].nr_free > size && !part_fill) { > + ret = -ENOSPC; > + goto out; > + } > + for (t = 0; t < MIGRATE_TYPES; t++) { > + list_for_each(curr, &zone->free_area[order].free_list[t]) { > + pfn = page_to_pfn(list_entry(curr, struct page, lru)); > + chunk = buf + *offset; > + if (*offset + 2 > size) { > + ret = -ENOSPC; > + goto out; > + } > + /* Align to the chunk format used in virtio-balloon */ > + *chunk = cpu_to_le64(pfn << 12); > + *(chunk + 1) = cpu_to_le64((1 << order) << 12); > + *offset += 2; > + } > + } > + > +out: > + spin_unlock_irqrestore(&zone->lock, flags); > + > + return ret; > +} This looks like it could disable interrupts for a long time. Too long? > +/* > + * The record_unused_pages() function is used to record the system unused > + * pages. The unused pages can be skipped to transfer during live migration. > + * Though the unused pages are dynamically changing, dirty page logging > + * mechanisms are able to capture the newly used pages though they were > + * recorded as unused pages via this function. > + * > + * This function scans the free page list of the specified order to record > + * the unused pages, and chunks those continuous pages following the chunk > + * format below: > + * -------------------------------------- > + * | Base (52-bit) | Rsvd (12-bit) | > + * -------------------------------------- > + * -------------------------------------- > + * | Size (52-bit) | Rsvd (12-bit) | > + * -------------------------------------- > + * > + * @start_zone: zone to start the record operation. > + * @order: order of the free page list to record. > + * @buf: buffer to record the unused page info in chunks. > + * @size: size of the buffer in __le64 to record > + * @offset: offset in the buffer to record. > + * @part_fill: indicate if partial fill is used. > + * > + * return -EINVAL if parameter is invalid > + * return -ENOSPC when the buffer is too small to record all the unsed pages > + * return 0 when sccess > + */ It's a strange thing - it returns information which will instantly become incorrect. > +int record_unused_pages(struct zone **start_zone, int order, > + __le64 *buf, unsigned int size, > + unsigned int *offset, bool part_fill) > +{ > + struct zone *zone; > + int ret = 0; > + bool skip_check = false; > + > + /* Make sure all the parameters are valid */ > + if (buf == NULL || offset == NULL || order >= MAX_ORDER) > + return -EINVAL; > + > + if (*start_zone != NULL) { > + bool found = false; > + > + for_each_populated_zone(zone) { > + if (zone != *start_zone) > + continue; > + found = true; > + break; > + } > + if (!found) > + return -EINVAL; > + } else > + skip_check = true; > + > + for_each_populated_zone(zone) { > + /* Start from *start_zone if it's not NULL */ > + if (!skip_check) { > + if (*start_zone != zone) > + continue; > + else > + skip_check = true; > + } > + ret = __record_unused_pages(zone, order, buf, size, > + offset, part_fill); > + if (ret < 0) { > + /* record the failed zone */ > + *start_zone = zone; > + break; > + } > + } > + > + return ret; > +} > +EXPORT_SYMBOL(record_unused_pages);