On Mon, 2016-05-16 at 13:39 +0300, Nikolay Borisov wrote: > > I've observed a strange performance pathology with it when running ipoib > and using a naive iperf test. My setup has multiple machines with a mix > of qlogic/mellanox cards, connected via an QLogic 12300 switch. All of > the nodes are running on 4x 10Gbps. When I run a performance test and > the mellanox card is a server i.e it is receiving data I get very bad > performance. By this I mean I cannot get more than 4 gigabits per > second - very low. 'perf top' clearly shows that the culprit is > intel_map_page which is being called form the receive path > of the mellanox adapter: > > 84.26% 0.04% ksoftirqd/0 [kernel.kallsyms] [k] intel_map_page > | > --- intel_map_page > | > |--98.38%-- ipoib_cm_alloc_rx_skb Are you *sure* it's disabled? Can you be more specific about where the time is spent? intel_map_page() doesn't really do much except calling in to __intel_map_single()... which should return fairly much immediately. I'm working on improving the per-device DMA ops so that for passthrough devices you don't end up in the IOMMU code at all, but it really shouldn't be taking *that* long... unless you really are doing translation. Note that even in the case where you're doing translation, there's code which I'm about to ask Linus to pull for 4.7, which will kill most of the performance hit of using the IOMMU. -- David Woodhouse Open Source Technology Centre David.Woodhouse@xxxxxxxxx Intel Corporation
Attachment:
smime.p7s
Description: S/MIME cryptographic signature