On 2020/7/20 下午7:16, Eugenio Pérez wrote:
On Mon, Jul 20, 2020 at 11:27 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:How about playing with the batch size? Make it a mod parameter instead of the hard coded 64, and measure for all values 1 to 64 ...Right, according to the test result, 64 seems to be too aggressive in the case of TX.Got it, thanks both!In particular I wonder whether with batch size 1 we get same performance as without batching (would indicate 64 is too aggressive) or not (would indicate one of the code changes affects performance in an unexpected way). -- MSTHi! Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,sorry this is not what I meant. I mean something like this: diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 0b509be8d7b1..b94680e5721d 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work) handle_rx(net); } +MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)"); +module_param(batch_num, int, 0644); +static int batch_num = 0; + static int vhost_net_open(struct inode *inode, struct file *f) { struct vhost_net *n; @@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) vhost_net_buf_init(&n->vqs[i].rxq); } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, - UIO_MAXIOV + VHOST_NET_BATCH, + UIO_MAXIOV + VHOST_NET_BATCH + batch_num, VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, NULL); then you can try tweaking batching and playing with mod parameter without recompiling. VHOST_NET_BATCH affects lots of other things.Ok, got it. Since they were aligned from the start, I thought it was a good idea to maintain them in-sync.and testing the pps as previous mail says. This means that we have either only vhost_net batching (in base testing, like previously to apply this patch) or both batching sizes the same. I've checked that vhost process (and pktgen) goes 100% cpu also. For tx: Batching decrements always the performance, in all cases. Not sure why bufapi made things better the last time. Batching makes improvements until 64 bufs, I see increments of pps but like 1%. For rx: Batching always improves performance. It seems that if we batch little, bufapi decreases performance, but beyond 64, bufapi is much better. The bufapi version keeps improving until I set a batching of 1024. So I guess it is super good to have a bunch of buffers to receive. Since with this test I cannot disable event_idx or things like that, what would be the next step for testing? Thanks! -- Results: # Buf size: 1,16,32,64,128,256,512 # Tx # === # Base 2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820 # Batch 2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286 # Batch + Bufapi 2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538 # Rx # == # pktgen results (pps) 1223275,1668868,1728794,1769261,1808574,1837252,1846436 1456924,1797901,1831234,1868746,1877508,1931598,1936402 1368923,1719716,1794373,1865170,1884803,1916021,1975160 # Testpmd pps results 1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75 1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034 1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316 pktgen was run again for rx with 1024 and 2048 buf size, giving 1988760.75 and 1978316 pps. Testpmd goes the same way.Don't really understand what does this data mean. Which number of descs is batched for each run?Sorry, I should have explained better. I will expand here, but feel free to skip it since we are going to discard the data anyway. Or to propose a better way to tell them. Is a CSV with the values I've obtained, in pps, from pktgen and testpmd. This way is easy to plot them. Maybe is easier as tables, if mail readers/gmail does not misalign them.# Tx # ===Base: With the previous code, not integrating any patch. testpmd is txonly mode, tap interface is XDP_DROP everything. We vary VHOST_NET_BATCH (1, 16, 32, ...). As Jason put in a previous mail: TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP 1 | 16 | 32 | 64 | 128 | 256 | 512 | 2293304.308| 3396057.769| 3540860.615| 3636056.077| 3332950.846| 3694276.154| 3689820| If we add the batching part of the series, but not the bufapi: 1 | 16 | 32 | 64 | 128 | 256 | 512 | 2286723.857 | 3307191.643| 3400346.571| 3452527.786| 3460766.857| 3431042.5 | 3440722.286| And if we add the bufapi part, i.e., all the series: 1 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 2257970.769| 3151268.385| 3260150.538| 3379383.846| 3424028.846| 3433384.308| 3385635.231| 3406554.538 For easier treatment, all in the same table: 1 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 ------------+-------------+-------------+-------------+-------------+-------------+------------+------------ 2293304.308 | 3396057.769 | 3540860.615 | 3636056.077 | 3332950.846 | 3694276.154 | 3689820 | 2286723.857 | 3307191.643 | 3400346.571 | 3452527.786 | 3460766.857 | 3431042.5 | 3440722.286| 2257970.769 | 3151268.385 | 3260150.538 | 3379383.846 | 3424028.846 | 3433384.308 | 3385635.231| 3406554.538# Rx # ==The rx tests are done with pktgen injecting packets in tap interface, and testpmd in rxonly forward mode. Again, each column is a different value of VHOST_NET_BATCH, and each row is base, +batching, and +buf_api:# pktgen results (pps)(Didn't record extreme cases like >512 bufs batching) 1 | 16 | 32 | 64 | 128 | 256 | 512 -------+--------+--------+--------+--------+--------+-------- 1223275| 1668868| 1728794| 1769261| 1808574| 1837252| 1846436 1456924| 1797901| 1831234| 1868746| 1877508| 1931598| 1936402 1368923| 1719716| 1794373| 1865170| 1884803| 1916021| 1975160# Testpmd pps results1 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 ------------+------------+------------+-----------+-----------+------------+------------+------------+--------- 1222698.143 | 1670604 | 1731040.6 | 1769218 | 1811206 | 1839308.75 | 1848478.75 | 1450140.5 | 1799985.75 | 1834089.75 | 1871290 | 1880005.5 | 1934147.25 | 1939034 | 1370621 | 1721858 | 1796287.75 | 1866618.5 | 1885466.5 | 1918670.75 | 1976173.5 | 1988760.75 | 1978316 The last extreme cases (>512 bufs batched) were recorded just for the bufapi case. Does that make sense now? Thanks!
I wonder why we saw huge difference between TX and RX pps. Have you used samples/pktgen/XXX for doing the test? Maybe you can paste the perf record result for the pktgen thread + vhost thread.
Thanks