On 19 Feb 2019, at 20:38, Anshuman Khandual wrote:
On 02/19/2019 06:26 PM, Matthew Wilcox wrote:
On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
But the location of this temp page matters as well because you would
like to
saturate the inter node interface. It needs to be either of the
nodes where
the source or destination page belongs. Any other node would
generate two
internode copy process which is not what you intend here I guess.
That makes no sense. It should be allocated on the local node of the
CPU
performing the copy. If the CPU is in node A, the destination is in
node B
and the source is in node C, then you're doing 4k worth of reads from
node C,
4k worth of reads from node B, 4k worth of writes to node C followed
by
4k worth of writes to node B. Eventually the 4k of dirty cachelines
on
node A will be written back from cache to the local memory (... or
not,
if that page gets reused for some other purpose first).
If you allocate the page on node B or node C, that's an extra 4k of
writes
to be sent across the inter-node link.
Thats right there will be an extra remote write. My assumption was
that the CPU
performing the copy belongs to either node B or node C.
I have some interesting throughput results for exchange per u64 and
exchange per 4KB page.
What I discovered is that using a 4KB page as the temporary storage for
exchanging
2MB THPs does not improve the throughput. On contrary, when we are
exchanging more than 2^4=16 THPs,
exchanging per 4KB page has lower throughput than exchanging per u64.
Please see results below.
The experiments are done on a two socket machine with two Intel Xeon
E5-2640 v3 CPUs.
All exchanges are done via the QPI link across two sockets.
Results
===
Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA
nodes
| 2mb_page_order | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
| 8 | 9
| u64 | 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50
| 9.57 | 9.62
| per_page | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30
| 7.32 | 7.31
Normalized throughput (to per_page)
2mb_page_order | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
| 8 | 9
u64 | 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26 | 1.30
| 1.30 | 1.31
Exchange page code
===
For exchanging per u64, I use the following function:
static void exchange_page(char *to, char *from)
{
u64 tmp;
int i;
for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
tmp = *((u64 *)(from + i));
*((u64 *)(from + i)) = *((u64 *)(to + i));
*((u64 *)(to + i)) = tmp;
}
}
For exchange per 4KB, I use the following function:
static void exchange_page2(char *to, char *from)
{
int cpu = smp_processor_id();
VM_BUG_ON(!in_atomic());
if (!page_tmp[cpu]) {
int nid = cpu_to_node(cpu);
struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 0);
if (!page_tmp_page) {
exchange_page(to, from);
return;
}
page_tmp[cpu] = kmap(page_tmp_page);
}
copy_page(page_tmp[cpu], to);
copy_page(to, from);
copy_page(from, page_tmp[cpu]);
}
where page_tmp is pre-allocated local to each CPU and alloc_pages_node()
above
is for hot-added CPUs, which is not used in the tests.
The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc
To do a comparison, you can clone this repo:
https://gitlab.com/ziy/thp-migration-bench,
then make, ./run_test.sh, and ./get_results.sh using the kernel from
above.
Let me know if I missed anything or did something wrong. Thanks.
--
Best Regards,
Yan Zi