Hi, I would like to provide more information just for anyone who might be interested. We modify FW to simulate a MPS 128 capability SSD, and experiment with different pcis_bus mode on Dell R730XD, hence we can have a better look at the whole picture. First, cold boot with single SSD device: Slot (C 256) SSD (C 128B) Slot (C 256B) SSD (C 256B) MPS MRRS MPS MRRS MPS MRRS MPS MRRS Normal 128 128 128 4096 Normal 256 128 256 4096 Perf 256 128 128 128 Perf 256 128 256 256 Safe 128 128 128 4096 Safe 256 128 256 4096 Then cold boot with two devices, different MPS capability, both slots are directly connect to CPU device: slot (C 256) SSD (C 128B) Slot (C 256B) SSD (C 256B) MPS MRRS MPS MRRS MPS MRRS MPS MRRS Normal 128 128 128 4096 256 128 256 4096 Perf 256 128 128 128 256 128 256 256 Safe 128 128 128 4096 256 128 256 4096 Finally, to match Sinan's example, we use a PCIe switch for two U.2 SSD: \-[0000:00]-+-00.0 +-01.0-[03]----00.0 +-02.0-[04]-- +-03.0-[02]--+-00.0 | \-00.1 +-03.1-[01]--+-00.0 | \-00.1 +-03.2-[05-0a]----00.0-[06-0a]--+-04.0-[07]-- | +-05.0-[08]----00.0 -> connect a 256B ssd | +-06.0-[09]----00.0 -> connect a 128B ssd 00.03.2 (C 256) 05:00.0 (C512) 06:05.0 (C512) 08:00.0 (SSD C256)06:06.0 (C512) 09:00.0 (SSD C128) MPS MRRS MPS MRRS MPS MRRS MPS MRRS MPS MRRS MPS MRRS Normal 128 128 128 128 128 128 128 4096 128 128 128 4096 Perf 256 128 256 128 256 128 256 256 256 128 128 128 Safe 128 128 128 128 128 128 128 4096 128 128 128 4096 I think from above examples: 1. perf mode is moving devices to 256 MPS as it can. 2. safe mode is setting to 128 MPS 3. perf mode set MRRS=MPS is a CORRECT call for device with MPSC lower than its parents. 4. perf mode set MRRS=MPS is not necessary for a device with SAME MPSC as its parents? 5. it is an interested point to me that slot/switch/root MRRS are always set to 128B, I have not found out why. Again, thanks for everyone's time on this subject. We have learnt a lot. Ron -----Original Message----- From: Bjorn Helgaas [mailto:helgaas@xxxxxxxxxx] Sent: 2018年1月25日 2:01 To: Radjendirane Codandaramane <radjendirane.codanda@xxxxxxxxxxxxx> Cc: Ron Yuan <ron.yuan@xxxxxxxxxxxx>; Sinan Kaya <okaya@xxxxxxxxxxxxxx>; Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; Bo Chen <bo.chen@xxxxxxxxxxxx>; William Huang <william.huang@xxxxxxxxxxxx>; Fengming Wu <fengming.wu@xxxxxxxxxxxx>; Jason Jiang <jason.jiang@xxxxxxxxxxxxx>; Ramyakanth Edupuganti <Ramyakanth.Edupuganti@xxxxxxxxxxxxx>; William Cheng <william.cheng@xxxxxxxxxxxxx>; Kim Helper (khelper) <khelper@xxxxxxxxxx>; Linux PCI <linux-pci@xxxxxxxxxxxxxxx> Subject: Re: One Question About PCIe BUS Config Type with pcie_bus_safe or pcie_bus_perf On NVMe Device On Tue, Jan 23, 2018 at 11:50:30PM +0000, Radjendirane Codandaramane wrote: > Hi Bjorne, s/Bjorne/Bjorn/ Also, the convention on Linux mailing lists is to use interleaved reply style [1] because it makes it easier to follow the conversation. > Ceiling the MRRS to the MPS value in order to guarantee the > interoperability in pcie_bus_perf mode does not make sense. I disagree with this part. The Linux pcie_bus_perf mode isn't my favorite approach, but given the assumptions it makes, it does make sense. I laid out the explanation in [2]; if you disagree with something there, please be specific about what's wrong. > A device can make a memrd request according to the MRRS setting (which > can be higher than its MPS), but the completer has to respect the MPS > and send completions accordingly. As an example, system can configure > MPS=128B and MRRS=4K, where an endpoint can a make 4K MemRd request, > but the completer has to send completions as 128B TLPs, by respecting > the MPS setting. MRRS does not force a device to use higher MPS value > than it is configured to. This part is certainly true. I used a very similar example in [2]: Consider the case where a function has MPS=256 and MRRS=1024: if the function generates a 1024-byte read request, it should receive 4 completions, each with a 256-byte data payload. > Another factor that need to be considered for storage devices is that > support of T10 Protection Information (DIF). For every 512B or 4KB, a > 8B PI is computed and inserted or verified, which require the 512B of > data to arrive in sequence. MPS and MRRS are parts of the PCIe TLP protocol. I'm not aware of any connection between the PCIe protocol and the T10 Protection Information/DIF stuff. I assume the T10/DIF stuff is content being carried by PCIe, and PCIe doesn't care at all about the content it carries. Can you clarify what the connection is here? > If the MRRS is < 512B, this might pose out of order completions to the > storage device, if the EP has to submit multiple outstanding read > requests in order to achieve higher performance. If I understand correctly, this is exactly the situation described in the implementation note in PCIe r4.0, sec 2.4.1: completions for a single memory read request are always returned in order (Table 2-39, entry D5b), but completions for multiple read requests may be returned in any order because completions may pass each other (entry D5a). The note doesn't mention this, but I think even the multiple outstanding read requests themselves can pass each other (entry B3). So I agree that if the function generates multiple outstanding read requests, the completions for those requests may come back out of order. For example, if the function generates req-A and req-B, completions for req-A will be in order and completions for req-B will be in order, but it might receive cmpl-B1, cmpl-A1, cmpl-B2, cmpl-B3, cmpl-A2, cmpl-A3, cmpl-A4, cmpl-B4. This reordering may happen no matter what the MRRS setting is. The only way to avoid it is to avoid multiple outstanding read requests. > This would be a challenge for the storage endpoints that process the > T10 PI inline with the transfer, now they have to store and process > the 512B sectors once they receive all the TLPs for that sector. Are you saying these endpoints depend on MRRS >= 512 for correct operation? I don't think think that would be legal, per sec 7.5.3.4, which says "the Function must not generate Read Requests with a size exceeding the set [MRRS] value." That doesn't leave the function any wiggle room -- if the OS programs MRRS=128, the function has to deal with it. There's no provision for the function to say "I can only support MRRS > X". But you probably mean the endpoints will function correctly even with MRRS=128, but performance will be poor. That's certainly a concern for all devices; reducing MRRS increases overhead, so we should use the largest possible MRRS, subject to the constraints of any MPS/MRRS design (like pcie_bus_perf) and concerns about bandwidth allocation (see the implementation note in sec 7.5.3.4). > So, it is better to decouple the MRRS and MPS in pcie_bus_perf mode. > Like stated earlier in the thread, provide an option to configure MRRS > separately in pcie_bus_perf mode. If you read [2] again, it will be obvious why the Linux pcie_bus_perf mode requires the use of MRRS in addition to MPS. At the very end of that email and in [3], I pointed out that the current Linux implementation is unnecessarily conservative in some cases, and there is room for fixing that, which should improve performance in some cases. But given the design of pcie_bus_perf, it's impossible to completely decouple MRRS and MPS in that mode. [1] https://en.wikipedia.org/wiki/Posting_style#Interleaved_style [2] https://lkml.kernel.org/r/20180119205153.GB160618@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [3] https://lkml.kernel.org/r/20180123174821.GF5317@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx