Re: performance regression noted in v5.11-rc after c062db039f40

Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx> · Tue, 19 Jan 2021 09:22:02 +0800

Hi Chuck,

On 1/19/21 4:09 AM, Chuck Lever wrote:

On Jan 18, 2021, at 1:00 PM, Robin Murphy <robin.murphy@xxxxxxx> wrote:

On 2021-01-18 16:18, Chuck Lever wrote:
On Jan 12, 2021, at 9:38 AM, Will Deacon <will@xxxxxxxxxx> wrote:

[Expanding cc list to include DMA-IOMMU and intel IOMMU folks]

On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
Hi-

[ Please cc: me on replies, I'm not currently subscribed to
iommu@lists ].

I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:

/home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I

For those not familiar with the way storage protocols use RDMA, The
initiator/client sets up memory regions and the target/server uses
RDMA Read and Write to move data out of and into those regions. The
initiator/client uses only RDMA memory registration and invalidation
operations, and the target/server uses RDMA Read and Write.

My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
enabled using the kernel command line options "intel_iommu=on
iommu=strict".

Recently I've noticed a significant (25-30%) loss in NFS throughput.
I was able to bisect on my client to the following commits.

Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg"). This is about normal for this test.

	Children see throughput for 12 initial writers 	= 4732581.09 kB/sec
	Parent sees throughput for 12 initial writers 	= 4646810.21 kB/sec
	Min throughput per process 			=  387764.34 kB/sec
	Max throughput per process 			=  399655.47 kB/sec
	Avg throughput per process 			=  394381.76 kB/sec
	Min xfer 					= 1017344.00 kB
	CPU Utilization: Wall time    2.671    CPU time    1.974    CPU utilization  73.89 %
	Children see throughput for 12 rewriters 	= 4837741.94 kB/sec
	Parent sees throughput for 12 rewriters 	= 4833509.35 kB/sec
	Min throughput per process 			=  398983.72 kB/sec
	Max throughput per process 			=  406199.66 kB/sec
	Avg throughput per process 			=  403145.16 kB/sec
	Min xfer 					= 1030656.00 kB
	CPU utilization: Wall time    2.584    CPU time    1.959    CPU utilization  75.82 %
	Children see throughput for 12 readers 		= 5921370.94 kB/sec
	Parent sees throughput for 12 readers 		= 5914106.69 kB/sec
	Min throughput per process 			=  491812.38 kB/sec
	Max throughput per process 			=  494777.28 kB/sec
	Avg throughput per process 			=  493447.58 kB/sec
	Min xfer 					= 1042688.00 kB
	CPU utilization: Wall time    2.122    CPU time    1.968    CPU utilization  92.75 %
	Children see throughput for 12 re-readers 	= 5947985.69 kB/sec
	Parent sees throughput for 12 re-readers 	= 5941348.51 kB/sec
	Min throughput per process 			=  492805.81 kB/sec
	Max throughput per process 			=  497280.19 kB/sec
	Avg throughput per process 			=  495665.47 kB/sec
	Min xfer 					= 1039360.00 kB
	CPU utilization: Wall time    2.111    CPU time    1.968    CPU utilization  93.22 %

Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
iommu_ops.at(de)tach_dev"). It's losing some steam here.

	Children see throughput for 12 initial writers 	= 4342419.12 kB/sec
	Parent sees throughput for 12 initial writers 	= 4310612.79 kB/sec
	Min throughput per process 			=  359299.06 kB/sec
	Max throughput per process 			=  363866.16 kB/sec
	Avg throughput per process 			=  361868.26 kB/sec
	Min xfer 					= 1035520.00 kB
	CPU Utilization: Wall time    2.902    CPU time    1.951    CPU utilization  67.22 %
	Children see throughput for 12 rewriters 	= 4408576.66 kB/sec
	Parent sees throughput for 12 rewriters 	= 4404280.87 kB/sec
	Min throughput per process 			=  364553.88 kB/sec
	Max throughput per process 			=  370029.28 kB/sec
	Avg throughput per process 			=  367381.39 kB/sec
	Min xfer 					= 1033216.00 kB
	CPU utilization: Wall time    2.836    CPU time    1.956    CPU utilization  68.97 %
	Children see throughput for 12 readers 		= 5406879.47 kB/sec
	Parent sees throughput for 12 readers 		= 5401862.78 kB/sec
	Min throughput per process 			=  449583.03 kB/sec
	Max throughput per process 			=  451761.69 kB/sec
	Avg throughput per process 			=  450573.29 kB/sec
	Min xfer 					= 1044224.00 kB
	CPU utilization: Wall time    2.323    CPU time    1.977    CPU utilization  85.12 %
	Children see throughput for 12 re-readers 	= 5410601.12 kB/sec
	Parent sees throughput for 12 re-readers 	= 5403504.40 kB/sec
	Min throughput per process 			=  449918.12 kB/sec
	Max throughput per process 			=  452489.28 kB/sec
	Avg throughput per process 			=  450883.43 kB/sec
	Min xfer 					= 1043456.00 kB
	CPU utilization: Wall time    2.321    CPU time    1.978    CPU utilization  85.21 %

And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
the iommu ops"). Significant throughput loss.

	Children see throughput for 12 initial writers 	= 3812036.91 kB/sec
	Parent sees throughput for 12 initial writers 	= 3753683.40 kB/sec
	Min throughput per process 			=  313672.25 kB/sec
	Max throughput per process 			=  321719.44 kB/sec
	Avg throughput per process 			=  317669.74 kB/sec
	Min xfer 					= 1022464.00 kB
	CPU Utilization: Wall time    3.309    CPU time    1.986    CPU utilization  60.02 %
	Children see throughput for 12 rewriters 	= 3786831.94 kB/sec
	Parent sees throughput for 12 rewriters 	= 3783205.58 kB/sec
	Min throughput per process 			=  313654.44 kB/sec
	Max throughput per process 			=  317844.50 kB/sec
	Avg throughput per process 			=  315569.33 kB/sec
	Min xfer 					= 1035520.00 kB
	CPU utilization: Wall time    3.302    CPU time    1.945    CPU utilization  58.90 %
	Children see throughput for 12 readers 		= 4265828.28 kB/sec
	Parent sees throughput for 12 readers 		= 4261844.88 kB/sec
	Min throughput per process 			=  352305.00 kB/sec
	Max throughput per process 			=  357726.22 kB/sec
	Avg throughput per process 			=  355485.69 kB/sec
	Min xfer 					= 1032960.00 kB
	CPU utilization: Wall time    2.934    CPU time    1.942    CPU utilization  66.20 %
	Children see throughput for 12 re-readers 	= 4220651.19 kB/sec
	Parent sees throughput for 12 re-readers 	= 4216096.04 kB/sec
	Min throughput per process 			=  348677.16 kB/sec
	Max throughput per process 			=  353467.44 kB/sec
	Avg throughput per process 			=  351720.93 kB/sec
	Min xfer 					= 1035264.00 kB
	CPU utilization: Wall time    2.969    CPU time    1.952    CPU utilization  65.74 %

The regression appears to be 100% reproducible.
Any thoughts?
How about some tools to try or debugging advice? I don't know where to start.

I'm not familiar enough with VT-D internals or Infiniband to have a clue why the middle commit makes any difference (the calculation itself is not on a fast path, so AFAICS the worst it could do is change your maximum DMA address size from 48/57 bits to 47/56, and that seems relatively benign).

Thanks for your response. Understood that you are not responding
about the middle commit (c062db039f40).

However, that's a pretty small and straightforward change, so I've
experimented a bit with that. Commenting out the new code there, I
get some relief:

	Children see throughput for 12 initial writers 	= 4266621.62 kB/sec
	Parent sees throughput for 12 initial writers 	= 4254756.31 kB/sec
	Min throughput per process 			=  354847.75 kB/sec
	Max throughput per process 			=  356167.59 kB/sec
	Avg throughput per process 			=  355551.80 kB/sec
	Min xfer 					= 1044736.00 kB
	CPU Utilization: Wall time    2.951    CPU time    1.981    CPU utilization  67.11 %

	Children see throughput for 12 rewriters 	= 4314827.34 kB/sec
	Parent sees throughput for 12 rewriters 	= 4310347.32 kB/sec
	Min throughput per process 			=  358599.72 kB/sec
	Max throughput per process 			=  360319.06 kB/sec
	Avg throughput per process 			=  359568.95 kB/sec
	Min xfer 					= 1043968.00 kB
	CPU utilization: Wall time    2.912    CPU time    2.057    CPU utilization  70.62 %

	Children see throughput for 12 readers 		= 4614004.47 kB/sec
	Parent sees throughput for 12 readers 		= 4609014.68 kB/sec
	Min throughput per process 			=  382414.81 kB/sec
	Max throughput per process 			=  388519.50 kB/sec
	Avg throughput per process 			=  384500.37 kB/sec
	Min xfer 					= 1032192.00 kB
	CPU utilization: Wall time    2.701    CPU time    1.900    CPU utilization  70.35 %

	Children see throughput for 12 re-readers 	= 4653743.81 kB/sec
	Parent sees throughput for 12 re-readers 	= 4647155.31 kB/sec
	Min throughput per process 			=  384995.69 kB/sec
	Max throughput per process 			=  390874.09 kB/sec
	Avg throughput per process 			=  387811.98 kB/sec
	Min xfer 					= 1032960.00 kB
	CPU utilization: Wall time    2.684    CPU time    1.907    CPU utilization  71.06 %

I instrumented the code to show the "before" and "after" values.

The value of domain->domain.geometry.aperture_end on my system
before this commit (and before the c062db039f40 code) is:

144,115,188,075,855,871 = 2^57

domain->domain.geometry.aperture_end makes no sense before c062db039f40.

The c062db039f40 code sets domain->domain.geometry.aperture_end to:

281,474,976,710,655 = 2^48

Do you mind posting the cap and ecap of the iommu used by your device?

You can get it via sysfs, for example:

/sys/bus/pci/devices/0000:00:14.0/iommu/intel-iommu# ls
address  cap  domains_supported  domains_used  ecap  version

Fwiw, this system uses the Intel C612 chipset with Intel(R) Xeon(R)
E5-2603 v3 @ 1.60GHz CPUs.

Can you please also hack a line of code to check the return value of
iommu_dma_map_sg()?

My sense is that "CPU time" remains about the same because the problem
isn't manifesting as an increase in instruction path length. Wall time
goes up, CPU time stays the same, the ratio of those (ie, utilization)
drops.

Best regards,
baolu