RE: One Question About PCIe BUS Config Type with pcie_bus_safe or pcie_bus_perf On NVMe Device

Ron Yuan <ron.yuan@xxxxxxxxxxxx> · Wed, 31 Jan 2018 08:40:21 +0000

Hi, I would like to provide more information just for anyone who might be interested. 
We modify FW to simulate a MPS 128 capability SSD, and experiment with different pcis_bus mode on Dell R730XD, hence we can have a better look at the whole picture.

First, cold boot with single SSD 
device:  Slot (C 256)     SSD (C 128B)	Slot (C 256B)   SSD (C 256B)	
        MPS  MRRS    MPS  MRRS   MPS MRRS       MPS MRRS 
Normal	128	128	     128	4096 
Normal	           	               	256	128	       256	4096
Perf	    256	128	     128	128	   
Perf	          	            	        256	128        256	256
Safe    	128	128	     128	4096	
Safe	          	            	        256	128	       256	4096

Then cold boot with two devices, different MPS capability， both slots are directly connect to CPU
device:  slot (C 256)     SSD (C 128B)	   Slot (C 256B)     SSD (C 256B)	
        MPS  MRRS    MPS  MRRS      MPS MRRS       MPS MRRS      
Normal	128	128	     128	4096	256  128       256  4096    
Perf     256	128	     128	128	    256  128       256  256    
Safe  	128	128	     128	4096	256  128       256  4096   

Finally, to match Sinan's example, we use a PCIe switch for two U.2 SSD:
\-[0000:00]-+-00.0
             +-01.0-[03]----00.0
             +-02.0-[04]--
             +-03.0-[02]--+-00.0
             |            \-00.1
             +-03.1-[01]--+-00.0
             |            \-00.1
             +-03.2-[05-0a]----00.0-[06-0a]--+-04.0-[07]--
             |                               +-05.0-[08]----00.0    -> connect a 256B ssd
             |                               +-06.0-[09]----00.0    -> connect a 128B ssd

		00.03.2 (C 256)  05:00.0 (C512)	06:05.0 (C512)	  08:00.0 (SSD C256）06:06.0 (C512)	09:00.0 (SSD C128)	
		MPS	MRRS	 MPS	MRRS	MPS	MRRS	 MPS	MRRS	MPS MRRS	MPS	MRRS
Normal	128	128	     128	128	    128	128	     128	4096	128	128	    128	4096
Perf		256	128      256	128	    256	128	     256	256	    256	128	    128	128
Safe		128	128	     128	128	    128	128	     128	4096	128	128	    128	4096

I think from above examples:
1. perf mode is moving devices to 256 MPS as it can.
2. safe mode is setting to 128 MPS 
3. perf mode set MRRS=MPS is a CORRECT call for device with MPSC lower than its parents.
4. perf mode set MRRS=MPS is not necessary for a device with SAME MPSC as its parents? 
5. it is an interested point to me that slot/switch/root MRRS are always set to 128B, I have not found out why.

Again, thanks for everyone's time on this subject. We have learnt a lot.

Ron

-----Original Message-----
From: Bjorn Helgaas [mailto:helgaas@xxxxxxxxxx] 
Sent: 2018年1月25日 2:01
To: Radjendirane Codandaramane <radjendirane.codanda@xxxxxxxxxxxxx>
Cc: Ron Yuan <ron.yuan@xxxxxxxxxxxx>; Sinan Kaya <okaya@xxxxxxxxxxxxxx>; Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; Bo Chen <bo.chen@xxxxxxxxxxxx>; William Huang <william.huang@xxxxxxxxxxxx>; Fengming Wu <fengming.wu@xxxxxxxxxxxx>; Jason Jiang <jason.jiang@xxxxxxxxxxxxx>; Ramyakanth Edupuganti <Ramyakanth.Edupuganti@xxxxxxxxxxxxx>; William Cheng <william.cheng@xxxxxxxxxxxxx>; Kim Helper (khelper) <khelper@xxxxxxxxxx>; Linux PCI <linux-pci@xxxxxxxxxxxxxxx>
Subject: Re: One Question About PCIe BUS Config Type with pcie_bus_safe or pcie_bus_perf On NVMe Device

On Tue, Jan 23, 2018 at 11:50:30PM +0000, Radjendirane Codandaramane wrote:
> Hi Bjorne,

s/Bjorne/Bjorn/

Also, the convention on Linux mailing lists is to use interleaved reply style [1] because it makes it easier to follow the conversation.

> Ceiling the MRRS to the MPS value in order to guarantee the 
> interoperability in pcie_bus_perf mode does not make sense.

I disagree with this part.  The Linux pcie_bus_perf mode isn't my favorite approach, but given the assumptions it makes, it does make sense.  I laid out the explanation in [2]; if you disagree with something there, please be specific about what's wrong.

> A device can make a memrd request according to the MRRS setting (which 
> can be higher than its MPS), but the completer has to respect the MPS 
> and send completions accordingly. As an example, system can configure 
> MPS=128B and MRRS=4K, where an endpoint can a make 4K MemRd request, 
> but the completer has to send completions as 128B TLPs, by respecting 
> the MPS setting. MRRS does not force a device to use higher MPS value 
> than it is configured to.

This part is certainly true.  I used a very similar example in [2]:

  Consider the case where a function has MPS=256 and MRRS=1024: if the
  function generates a 1024-byte read request, it should receive 4
  completions, each with a 256-byte data payload.

> Another factor that need to be considered for storage devices is that 
> support of T10 Protection Information (DIF). For every 512B or 4KB, a 
> 8B PI is computed and inserted or verified, which require the 512B of 
> data to arrive in sequence.

MPS and MRRS are parts of the PCIe TLP protocol.  I'm not aware of any connection between the PCIe protocol and the T10 Protection Information/DIF stuff.  I assume the T10/DIF stuff is content being carried by PCIe, and PCIe doesn't care at all about the content it carries.  Can you clarify what the connection is here?

> If the MRRS is < 512B, this might pose out of order completions to the 
> storage device, if the EP has to submit multiple outstanding read 
> requests in order to achieve higher performance.

If I understand correctly, this is exactly the situation described in the implementation note in PCIe r4.0, sec 2.4.1: completions for a single memory read request are always returned in order (Table 2-39, entry D5b), but completions for multiple read requests may be returned in any order because completions may pass each other (entry D5a).  The note doesn't mention this, but I think even the multiple outstanding read requests themselves can pass each other (entry B3).

So I agree that if the function generates multiple outstanding read requests, the completions for those requests may come back out of order.  For example, if the function generates req-A and req-B, completions for req-A will be in order and completions for req-B will be in order, but it might receive cmpl-B1, cmpl-A1, cmpl-B2, cmpl-B3, cmpl-A2, cmpl-A3, cmpl-A4, cmpl-B4.

This reordering may happen no matter what the MRRS setting is.  The only way to avoid it is to avoid multiple outstanding read requests.

> This would be a challenge for the storage endpoints that process the
> T10 PI inline with the transfer, now they have to store and process 
> the 512B sectors once they receive all the TLPs for that sector.

Are you saying these endpoints depend on MRRS >= 512 for correct operation?

I don't think think that would be legal, per sec 7.5.3.4, which says "the Function must not generate Read Requests with a size exceeding the set [MRRS] value."  That doesn't leave the function any wiggle room -- if the OS programs MRRS=128, the function has to deal with it.
There's no provision for the function to say "I can only support MRRS > X".

But you probably mean the endpoints will function correctly even with MRRS=128, but performance will be poor.  That's certainly a concern for all devices; reducing MRRS increases overhead, so we should use the largest possible MRRS, subject to the constraints of any MPS/MRRS design (like pcie_bus_perf) and concerns about bandwidth allocation (see the implementation note in sec 7.5.3.4).

> So, it is better to decouple the MRRS and MPS in pcie_bus_perf mode.
> Like stated earlier in the thread, provide an option to configure MRRS 
> separately in pcie_bus_perf mode.

If you read [2] again, it will be obvious why the Linux pcie_bus_perf mode requires the use of MRRS in addition to MPS.  At the very end of that email and in [3], I pointed out that the current Linux implementation is unnecessarily conservative in some cases, and there is room for fixing that, which should improve performance in some cases.  But given the design of pcie_bus_perf, it's impossible to completely decouple MRRS and MPS in that mode.

[1] https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
[2] https://lkml.kernel.org/r/20180119205153.GB160618@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[3] https://lkml.kernel.org/r/20180123174821.GF5317@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx