Re: [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA

"Garg, Shivank" <shivankg@xxxxxxx> · Tue, 25 Jun 2024 14:27:50 +0530

Hi,

On 6/17/2024 5:10 PM, Garg, Shivank wrote:
> Hi Matthew,
> 
> On 6/15/2024 9:32 AM, Matthew Wilcox wrote:
>> On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote:
> 
>>
>> You haven't measured the important thing though -- what's the cost
>> _to userspace_?  When the CPU does the copy, the data is now
>> cache-hot in that CPU's cache.  When the DMA engine does the copy,
>> it's not cache-hot in any CPU.
>>
>> Now, this may not be a big problem.  I don't think we do anything to 
>> ensure that the CPU that is going to access the folio in userspace
>> is the one which does the copy.
>>
>> But your methodology is wrong.
> 
> You're right about importance of measuring the cost to userspace.
> I initially focused on analyzing the folio_copy overheads within migrate_pages to identify potential optimizations opportunities using DMA hardware accelerators.
> 
> To address this, I'm planning extend my experiments to measure the cost to userspace specifically related to cache-hotness. This will involve the accessing the migrated pages after the migration process is complete, and measuring the resulting latency to read/write.
> 
> This approach of DMA-offloading could possibly help in scenarios involving bulk data copying with workload size >> cache capacity or incurs a large shootdown overhead.
> 
> The userspace cost analysis will provide a more comprehensive picture of page-migration using CPU v/s DMA-offloading.
> 
> I appreciate your feedback.

I extended my earlier experiments for page migration from remote node to
a local NUMA node. This involves measuring the cost to userspace for
different workload sizes (4KB, 2MB, 256MB, and 1GB).
My experiments capture two scenarios: First, Smaller workload size (4KB and 2MB)
that fit within the CPU cache. Second, Larger workload size (512MB and 1GB)
that exceeds cache capacity.

move_pages for N pages from src_node=0 to dst_node=1

Measurement: Mean ± SD is reported in cpu cycles per page (normalized
w.r.t. number of pages = N)

move_pages: Cycles taken by move_pages(2) syscall (cost per page)
uncached_access: Cycles taken to access memory (just after clflush) for pages
on src node 1.
cached_access: Cycles taken to access memory (when everything is previously
touched) for pages on src node 1.
post_move_access: Cycles taken to access memory just after move_pages syscall
(when pages are moved to dst node 0)

Generic Kernel:
4KB:: move_pages:193154.40±50519.59  uncached_access:1269.40±163.11  cached_access:383.00±31.92  post_move_access:420.40±77.04
2MB:: move_pages:4930.36±100.74  uncached_access:793.46±82.39  cached_access:208.59±2.07  post_move_access:181.34±11.55
512MB:: move_pages:4498.93±146.95  uncached_access:656.43±23.08  cached_access:801.93±111.80  post_move_access:402.37±15.26
1GB:: move_pages:4419.88±203.91  uncached_access:627.85±13.24  cached_access:776.01±94.27  post_move_access:384.24±7.33

Results with Patched Kernel:
1. Offload disabled - Folios batch-move using CPU
4KB:: move_pages:206370.20±28303.18  uncached_access:1265.20±141.38  cached_access:385.40±54.32  post_move_access:407.80±52.60
2MB:: move_pages:5110.16±188.60  uncached_access:794.05±72.25  cached_access:208.65±1.75  post_move_access:177.48±9.93
512MB:: move_pages:4548.00±188.91  uncached_access:658.23±23.63  cached_access:777.34±113.15  post_move_access:403.48±17.27
1GB:: move_pages:4521.19±195.13  uncached_access:628.85±14.72  cached_access:750.85±98.22  post_move_access:387.79±9.49

2. Offload enabled - Folios batch-move using DMAengine
4KB:: move_pages:222818.00±22710.80  uncached_access:1277.80±145.74  cached_access:405.20±101.85  post_move_access:427.60±130.13
2MB:: move_pages:15590.80±288.89  uncached_access:799.36±76.60  cached_access:208.79±2.11  post_move_access:183.21±11.67
512MB:: move_pages:14154.06±197.59  uncached_access:649.93±20.35  cached_access:814.10±109.81  post_move_access:403.43±13.79
1GB:: move_pages:14415.04±303.83  uncached_access:629.03±14.83  cached_access:731.16±97.67  post_move_access:385.08±7.62

Code snippet to access memory:
before = rdtsc();
for (int i = 0; i < num_pages; i++) {
	for (int j = 0; j < page_size; j += 64) {
		junk += *(long *)(pages[i] + j);
	}
}
after = rdtsc();

Discussion:
1. My analysis revealed no significant difference in post-move access times
between CPU and DMA migration.
2. For smaller workloads, cached accesses are significantly faster than
uncached accesses. However, for larger workloads, caches become less effective.
3. As expected, post-migration access times are significantly lower due to
NUMA locality.
4. Just to make sure prefetchers weren't messing with things, I ran another
test with them turned off. The post-migration access cycles for DMA and CPU
with prefetcher-disabled are still similar.

Thanks,
Shivank