Re: [PATCH v1 6/6] xprtrdma: Pull up sometimes

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 22 Oct 2019 12:08:06 -0400

> On Oct 19, 2019, at 12:36 PM, Tom Talpey <tom@xxxxxxxxxx> wrote:
> 
> On 10/18/2019 7:34 PM, Chuck Lever wrote:
>> Hi Tom-
>>> On Oct 18, 2019, at 4:17 PM, Tom Talpey <tom@xxxxxxxxxx> wrote:
>>> 
>>> On 10/17/2019 2:31 PM, Chuck Lever wrote:
>>>> On some platforms, DMA mapping part of a page is more costly than
>>>> copying bytes. Restore the pull-up code and use that when we
>>>> think it's going to be faster. The heuristic for now is to pull-up
>>>> when the size of the RPC message body fits in the buffer underlying
>>>> the head iovec.
>>>> Indeed, not involving the I/O MMU can help the RPC/RDMA transport
>>>> scale better for tiny I/Os across more RDMA devices. This is because
>>>> interaction with the I/O MMU is eliminated, as is handling a Send
>>>> completion, for each of these small I/Os. Without the explicit
>>>> unmapping, the NIC no longer needs to do a costly internal TLB shoot
>>>> down for buffers that are just a handful of bytes.
>>> 
>>> This is good stuff. Do you have any performance data for the new
>>> strategy, especially latencies and local CPU cycles per byte?
>> Saves almost a microsecond of RT latency on my NFS client that uses
>> a real Intel IOMMU. On my other NFS client, the DMA map operations
>> are always a no-op. This savings applies only to NFS WRITE, of course.
>> I don't have a good benchmark for cycles per byte. Do you have any
>> suggestions? Not sure how I would account for cycles spent handling
>> Send completions, for example.
> 
> Cycles per byte is fairly simple but like all performance measurement
> the trick is in the setup. Because of platform variations, it's best
> to compare results on the same hardware. The absolute value isn't as
> meaningful. Here's a rough sketch of one approach.
> 
> - Configure BIOS and OS to hold CPU frequency constant:
>  - ACPI C-states off
>  - Turbo mode off
>  - Power management off (OS needs this too)
>  - Anything else relevant to clock variation
> - Hyperthreading off
>  - (hyperthreads don't add work linearly)
> - Calculate core count X clock frequency
>  - (e.g. 8 X 3GHz = 24G cycles/sec)
> 
> Now, use a benchmark which runs the desired workload and reports %CPU.
> For a given interval, record the total bytes transferred, time spent,
> and CPU load. (e.g. 100GB, 100 sec, 20%).
> 
> Finally, compute CpB (the 1/sec terms cancel out):
> 20% x 24Gcps = 4.8G cps
> 100GB / 100s = 1G bps
> 4.8Gcps / 1 GBps = 4.8cpb
> 
> Like I said, it's rough, but surprisingly telling. A similar metric
> is cycles per IOP, and since you're focusing on small i/o with this
> change, it might also be an interesting calculation. Simply replace
> total bytes/sec with IOPS.

Systems under test:

	• 12 Haswell cores x 1.6GHz = 19.2 billion cps
	• Server is exporting a tmpfs filesystem
	• Client and server using CX-3 Pro on 56Gbps InfiniBand
	• Kernel is v5.4-rc4
	• iozone -M -+u -i0 -i1 -s1g -r1k -t12 -I

The purpose of this test is to compare the two kernels, not to publish an absolute performance value. Both kernels below have a number of CPU-intensive debugging options enabled, which might tend to increase CPU cycles per byte or per I/O, and might also amplify the differences between the two kernels.

*** With DMA-mapping kernel (confirmed after test - total pull-up was zero bytes):

WRITE tests:

	• Write test: CPU Utilization: Wall time  496.136    CPU time  812.879    CPU utilization 163.84 %
	• Re-write test: CPU utilization: Wall time  500.266    CPU time  822.810    CPU utilization 164.47 %

Final mountstats results:

WRITE:
    25161863 ops (50%)
    avg bytes sent per op: 1172    avg bytes received per op: 136
    backlog wait: 0.094913     RTT: 0.048245     total execute time: 0.213270 (milliseconds)

Based solely on the iozone Write test:
12 threads x 1GB file = 12 GB transferred
12 GB / 496 s = 25973227 Bps
19.2 billion cps / 25973227 Bps = 740 cpB @ 1KB I/O

Based on both the iozone Write and Re-write tests:
25161863 ops / 996 s = 25263 IOps
19.2 billion cps / 25263 IOps = 760004 cpIO

READ tests:

	• Read test: CPU utilization: Wall time  451.762    CPU time  826.888    CPU utilization 183.04 %
	• Re-read test: CPU utilization: Wall time  452.543    CPU time  827.575    CPU utilization 182.87 %

Final mountstats results:

READ:
    25146066 ops (49%)
    avg bytes sent per op: 140    avg bytes received per op: 1152
    backlog wait: 0.092140     RTT: 0.045202     total execute time: 0.205996 (milliseconds)

Based solely on the iozone Read test:
12 threads x 1GB file = 12 GB transferred
12 GB / 451 s = 28569627 Bps
19.2 billion cps / 28569627 Bps = 672 cpB @ 1KB I/O

Based on both the iozone Read and Re-read tests:
25146066 ops / 903 s = 27847 IOps
19.2 billion cps / 27847 IOps = 689481 cpIO

*** With pull-up kernel (confirmed after test - total pull-up was 25763734528 bytes):

WRITE tests:
	• Write test: CPU Utilization: Wall time  453.318    CPU time  839.581    CPU utilization 185.21 %
	• Re-write test: CPU utilization: Wall time  458.717    CPU time  850.335    CPU utilization 185.37 %

Final mountstats results:

WRITE:
          25159897 ops (50%)
        avg bytes sent per op: 1172     avg bytes received per op: 136
        backlog wait: 0.080036  RTT: 0.049674   total execute time: 0.183426 (milliseconds)

Based solely on the iozone Write test:
12 threads x 1GB file = 12 GB transferred
12 GB / 453 s = 28443492 Bps
19.2 billion cps / 28443492 Bps = 675 cpB @ 1KB I/O

Based on both the iozone Write and Re-write tests:
25159897 ops / 911 s = 27617 IOps
19.2 billion cps / 27617 IOps = 695223 cpIO

READ tests:

	• Read test: CPU utilization: Wall time  451.248    CPU time  834.203    CPU utilization 184.87 %
	• Re-read test: CPU utilization: Wall time  451.113    CPU time  834.302    CPU utilization 184.94 %

Final mountstats results:

READ:
    25149527 ops (49%)
    avg bytes sent per op: 140    avg bytes received per op: 1152
    backlog wait: 0.091011     RTT: 0.045790     total execute time: 0.203793 (milliseconds)

Based solely on the iozone Read test:
12 threads x 1GB file = 12 GB transferred
12 GB / 451 s = 28569627 Bps
19.2 billion cps / 28569627 Bps = 672 cpB @ 1KB I/O

Based on both the iozone Read and Re-read tests:
25149527 ops / 902 s = 27881 IOps
19.2 billion cps / 27881 IOps = 688641 cpIO

*** Analysis:

For both kernels, the READ tests are close. This demonstrates that the patch does not have any gross effects on the READ path, as expected.

The WRITE tests are more remarkable.
	• Mean total execute time per WRITE RPC decreases by about 30 microseconds. Almost half of that is decreased backlog wait.
	• Mean round-trip time increases by a microsecond and a half. My earlier report that RT decreased by a microsecond was based on a QD=1 direct latency measure.
	• For 1KB WRITE: IOPS, Cycles per byte written and Cycles per I/O are now within spitting distance of the same metrics for 1KB READ.

--
Chuck Lever