>>> PCIe Spec does not explicitly mention this issue, we can only get the message that >>> root port/ root complex can split the TLP into smaller packets. For instance >>> one 256B packet split into two 128B packet. >>> >>> I confirm this issue in my X86 machine and IA64 machine. >>> 1. I unload NIC driver to make sure the safety during change the NIC MPS. >>> 2. Use setpci change NIC MPS to the max value it supports. >>> 3. Reload the NIC driver >>> 4. Ping and use scp cpoy large file bwtween machines. Result is ok. > > Just as a way to confirm that the MPS change is actually doing > something, I assume you observe a performance difference between > MPS=128 and MPS=512 on the NIC (and the root port MPS=128 in both > cases)? Or maybe you can confirm with an analyzer that there are > actually 512-byte TLPs on the link? Hi Bjorn, I didn't observe a performance difference between MPS=128 and MPS=512. I use ping $dest_ip -s 65500(large size packet) to test the different situations. 1. root port MPS = 128, EP MPS = 256. root port --------Endpoint device 00:01.0 01:00.1 In this case, I use ping in the local machine, and result is ok. linux:~ # ping 128.5.160.28 -s 65500 PING 128.5.160.28 (128.5.160.28) 65500(65528) bytes of data. 65508 bytes from 128.5.160.28: icmp_seq=1 ttl=64 time=1.43 ms 65508 bytes from 128.5.160.28: icmp_seq=2 ttl=64 time=1.42 ms 65508 bytes from 128.5.160.28: icmp_seq=3 ttl=64 time=1.41 ms 65508 bytes from 128.5.160.28: icmp_seq=4 ttl=64 time=1.37 ms 65508 bytes from 128.5.160.28: icmp_seq=5 ttl=64 time=1.43 ms .......... \-[0000:00]-+-00.0 Intel Corporation 5500 I/O Hub to ESI Port +-01.0-[01]--+-00.0 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet | \-00.1 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet linux:~ # lspci -vvv -s 01:00.1 01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) ............[snip]............. Capabilities: [ac] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes linux:~ # lspci -vvv -s 00:01.0 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22) (prog-if 00 [Normal decode]) ...........[snip].............. Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes 2. root port MPS = 256, EP MPS = 128. In this case, use "ping $dest_ip -s 65500" to test, but result is fail. So I guess the packet size during ping is larger than 128, EP device discard these TLPs. I have no analyzer to catch the TLP packets. So I can not Guarantee this conclusion(EP MPS larger than Root port is 100% safe). > > I assume there are no AER or other errors logged by the root port? Yes, AER is not support in local machine. > The test you showed was a copy *to* the local machine, so the NIC > would have been doing DMA writes to memory. I assume it works equally > well doing a copy *from* the local machine to another machine across > the network, where the NIC is doing DMA reads from memory? Yes, I tested in both copy direction, and result is ok. > The only mention I can find in the spec is sec 1.3.1, where it says "a > Root Complex is generally permitted to split a packet into smaller > packets when routing transactions peer-to-peer between hierarchy > domains ..." > > I'm not a hardware guy (I often wish I were :)), but here's how I > interpret that statement. Let's take the following example: > > 00:01.0 Root port bridge to [bus 01] MPS=128 > 01:00.1 Endpoint MPS=512 > > 00:02.0 Root Port bridge to [bus 02] MPS=256 > 00:03.0 Root Port bridge to [bus 03] MPS=128 > 02:00.0 Endpoint MPS=256 > 03:00.0 Endpoint MPS=128 > > If 02:00.0 (MPS=256) generates a DMA write destined for 03:00.0, it > may transmit a TLP with a data payload of 256 bytes, and 00:02.0 > (MPS=256 also) will accept it. The root complex may route the packet > to 00:03.0 (MPS=128), and here it would need to be split into two > 128-byte TLPs before being transmitted by 00:03.0 to 03:00.0 > (MPS=128). > > Your situation is basically 01:00.1 (MPS=512) doing a DMA write > destined for memory and sending a 512-byte TLP to 00:01.0 (MPS=128). > In this case, the root complex isn't doing any peer-to-peer routing > between hierarchy domains, so I don't think the statement in sec 1.3.1 > applies. So I don't understand why the root port would accept that > TLP. I would think it would report a malformed TLP error. Hmmm, PCIe Spec does not involve too much about MPS setting. So maybe different platform has different strategy. Conservatively, as a improvement for mps setting after hotplug. I think update mps setting equal to its parent make sense. This is no harm to other devices, we only modify the hotplug device itself mps register. So if you agree, I will update my patch ,only try to modify hotplug device mps, make them equal to its parent. Thanks! Yijing. -- Thanks! Yijing -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html