Re: [PATCH] pci: endpoint: functions: Add a virtnet EP function

Haotian Wang <haotian.wang@xxxxxxxxxx> · Mon, 26 Aug 2019 17:59:57 -0400

Hi Kishon,

Thank you so much for the reply!

On Mon, Aug 26, 2019 at 6:51 AM Kishon Vijay Abraham I <kishon@xxxxxx> wrote:
> > This function driver is tested on the following pair of systems. The PCI
> > endpoint is a Xilinx VCU118 board programmed with a SiFive Linux-capable
> > core running Linux 5.2. The PCI host is an x86_64 Intel(R) Core(TM)
> > i3-6100 running unmodified Linux 5.2. The virtual link achieved a
> > stable throughput of ~180KB/s during scp sessions of a 50M file. The
> 
> I assume this is not using DMA as below you mentioned you got worse throughput
> with DMA. What's the throughput using DMA?
>From host to endpoint, scp speed was 180KB/s without dma and 130KB/s
with dma. From endpoint to host, scp speed was 220KB/s without dma and
150KB/s. My guess for the causes of lower throughput when dma is used is
that there the two major reasons. Firstly, the platform dma
implementation of the hardware I used was pretty new. It had many
inefficient algorithms. Secondly, the dma transfer function of pci-epf
seems to make blocking calls. In the pci_epf_tx() function,
pci_epf_data_transfer() is called. pci_epf_data_transfer will wait on
completion of the dma transfer. Since pci_epf_data_transfer runs on the
same kernel thread as the main function that handles endpoint to
host transfer in pci_epf_virtio, every packet that gets transferred via
dma still blocks the thread for the duration of transfer. This sort of
defeats the purpose of dma.

There was actually a critical error I made about dma in the patch. The
dma patch to the endpoint framework,
http://git.ti.com/cgit/cgit.cgi/ti-linux-kernel/ti-linux-kernel.git/tree/drivers/pci/endpoint/pci-epf-core.c?h=ti-linux-4.19.y,
has not been merged into the upstream kernel yet. dma related code
should not appear in this version of the patch. I apologize for this
mistake.

> At a high level, you need more layering as it'll help to add more virtio based
> devices over PCIe. Ideally all the vring/virtqueue part should be added as a
> library.
> 
> You've modeled the endpoint side as virtio_device. However I would have
> expected this to be vhost_dev and would have tried re-using some parts of 
> vhost (ignoring the userspace part). Was this considered during your design?
Thank you for the suggestion about more layering. I have thought about
that. virtio has done a very good job of separating out
function-specific drivers (virtio_net etc.), vring/virtqueue setup
(virtio_pci) and actual data transfer (virtio_ring). Thus in this
endpoint function, I can easily set up a remote virtio_device
representing the PCI host and a local virtio_device representing the
endpoint itself. The difficulty lies in actual transfer of data. For
example, virtio_net and virtio_blk use very different transfer
mechanisms. The best I can do now is probably abstracting out the setup
phase as a library of some sort.

I haven't taken a close look at vhost. Using virtio_device was mainly
because I did not change the code on the PCI host side, therefore using
the same structs as virtio_pci and virtio_ring made it easy to access
data structures on the PCI host from the endpoint. Another reason is
that in this endpoint function, the use case of virtio_device was not
entirely the same as that of kvm/qemu. Instead, this was probably closer
to what veth did, in that it established a connection between a pair of
virtio_devices. So far virtio_device has served the purpose well and I
could reuse a lot of code from virtio.

> Please add the Documentation as a separate patch.
Should I submit that as a different patch in the same patch series or a
totally different patch? Thanks!

> > +	CONFIG_VIRTIO
> > +	CONFIG_VIRTIO
> 
> ^^redundant line.
Will fix.

> > +CONFIG_PCI_HOST_LITTLE_ENDIAN must be set at COMPILE TIME. Toggle it on to build
> > +the module with the PCI host being in little endianness.
> It would be better if we could get the endianness of the host at runtime. That
> way irrespective of the host endianness we could use the same kernel image in
> endpoint.
There are two ways I can imagine of achieving this. The first is to
change the whole endpoint function into using modern virtio interfaces,
because those specify little endianness to be used in all of __virtio16,
__virtio32 etc. I didn't take that path because the development platform
did not allow me to access some PCI configuration space registers, such
as the vendor-specific capabilities. These were required to configure a
virtio_device representing the PCI host.

The second way is to add a module parameter for host endianness. The
user has to make sure that module parameter is setup correctly before
this endpoint function calls linkup() though.

> > +Enable PCI_ENDPOINT_DMAENGINE if your endpoint controller has an implementation
> 
> Presence of dma engine could come from epc_features. Or try to get dma channel
> always and use mem_copy if that fails. config option for dmaengine looks
> un-necessary.
This ties back to the previous point of the unmerged dma patch. The
correct way to implement dma depends on that patch.

> > +config PCI_EPF_VIRTIO_SUPPRESS_NOTIFICATION
> > +	bool "PCI Virtio Endpoint Function Notification Suppression"
> > +	default n
> > +	depends on PCI_EPF_VIRTIO
> > +	help
> > +	  Enable this configuration option to allow virtio queues to suppress
> > +	  some notifications and interrupts. Normally the host and the endpoint
> > +	  send a notification/interrupt to each other after each packet has been
> > +	  provided/consumed. Notifications/Interrupts can be generally expensive
> > +	  across the PCI bus. If this config is enabled, both sides will only
> > +	  signal the other end after a batch of packets has been consumed/
> > +	  provided. However, in reality, this option does not offer significant
> > +	  performance gain so far.
> 
> Would be good to profile and document the bottle-neck so that this could be
> improved upon.
I have a theory for this. The only real "interrupt" is from the
endpoint to host. The "notification" from the host to endpoint is
actually enabled by the endpoint continuously polling for a value in
BAR 0. When the host wants to notify the endpoint, it writes to an
offset in BAR 0 with the index of the virtqueue where an event just
occurs. The endpoint has a dedicated loop that monitors when that value.
Because of this setup, making the host send fewer notifications does not
help because the bottleneck is probably in the expensive polling on the
endpoint. As a consequence, suppressing notification and interrupts does
not seem to offer performance gain.

> > +/* Default bar sizes */
> > +static size_t bar_size[] = { 512, 512, 1024, 16384, 131072, 1048576 };
> 
> Only use the BARs actually required by the function.
Will do.

> > +/*
> > + * Clear mapped memory of a map. If there is memory allocated using the
> > + * pci-ep framework, that memory will be released.
> > + *
> > + * @map: a map struct pointer that will be unmapped
> > + */
> > +static void pci_epf_unmap(struct pci_epf_map *map)
> > +{
> > +	if (map->iobase) {
> 
> how about this instead..
> 	if (!map->iobase)
> 		return;
Sure.

> > +	align = map->align;
> > +	iosize = (align > PAGE_SIZE && size < align) ? align : size;
> 
> The align parameter should already be configured correctly by epc_features and
> the size should be already handled by pci_epc_mem_alloc_addr().
This "align" is exactly the same as the align from epc_features. This
line of code actually proved necessary in my development platform. The
epc mem allocator only makes sure the memory allocated is aligned but it
fails to operate on PCI host memory that is not properly aligned. The
endpoint device I developed on had a disastrous 64K page size. When
reading from a physical memory address on the PCI host, the lower 16
bits of the memory address were all zeroed out. For example, when the
endpoint tried to read the byte at 0x12345 (a phys_addr_t) on the PCI
host, what it actually read was the byte at 0x10000. Because of this, I
had to potentially allocate a much larger space than asked for. If
wanted to access 0x12345, after mapping, map->phys_iobase would be
0x10000, map->phys_ioaddr would be 0x12345, and a whole 64K memory
region would be allocated.

> This looks unnecessary.
See above.

> > +/*
> > + * Get value from the virtio network config of the local virtio device.
> > + *
> > + * @vdev: local virtio device
> > + * @offset: offset of starting memory address from the start of local
> > + *	    virtio network config in bytes
> > + * @buf: virtual memory address to store the value
> > + * @len: size of requested data in bytes
> > + */
> > +static inline void epf_virtio_local_get(struct virtio_device *vdev,
> > +					unsigned int offset,
> > +					void *buf,
> > +					unsigned int len)
> > +{
> > +	memcpy(buf,
> > +	       (void *)&vdev_to_epf_vdev(vdev)->local_net_cfg + offset,
> > +	       len);
> > +}
> 
> Have all this network specific parts in a separate file. Use the layering
> structure similar to vhost.
Will try to do.

> > +/*
> > + * Initializes the virtio_pci and virtio_net config space that will be exposed
> > + * to the remote virtio_pci and virtio_net modules on the PCI host. This
> > + * includes setting up feature negotiation and default config setup etc.
> > + *
> > + * @epf_virtio: epf_virtio handler
> > + */
> > +static void pci_epf_virtio_init_cfg_legacy(struct pci_epf_virtio *epf_virtio)
> > +{
> > +	const u32 dev_feature =
> > +		generate_dev_feature32(features, ARRAY_SIZE(features));
> > +	struct virtio_legacy_cfg *const legacy_cfg = epf_virtio->reg[BAR_0];
> 
> virtio_reg_bar instead of BAR_0
The dilemma was that the virtio_pci on PCI host will only write to BAR
0. I may need to throw an error if the first free bar is not BAR 0.

> > +	pci_epc_stop(epc);
> 
> You should never have pci_epc_stop() in function driver as that will break
> multi-function endpoint devices. I'll fix this in pci-epf-test.c.
Look forward to your progress on this.

> > +	for (bar = BAR_0; bar <= BAR_5; bar += add) {
> 
> Are you using all these BARs? It's best to allocate and initialize the BARs we use.
Will only use BAR 0 instead.

> Please add a description for each of these structures.
I had to copy these structures exactly as they were from virtio_ring.c
unfortunately, because they were not exposed via any header file. If
virtio_ring.c has some struct changes, this endpoint function will have
to change accordingly.

Thank you so much for taking time to review this patch. Now that I came
back to university and continued my undergrad study, my kernel
development work will probably slow down a lot. The heavy-lifting work
such as creating more layers to allow more virtio devices will take a
much longer time.

Best,
Haotian