Re: [PATCH v3 1/1] virtio-blk: avoid preallocating big SGL for data

Max Gurtovoy <mgurtovoy@xxxxxxxxxx> · Thu, 23 Sep 2021 16:40:56 +0300

Hi MST/Jens,

Do we need more review here or are we ok with the code and the test matrix ?

If we're ok, we need to decide if this goes through virtio PR or block PR.

Cheers,

-Max.

On 9/14/2021 3:22 PM, Stefan Hajnoczi wrote:
On Mon, Sep 13, 2021 at 05:50:21PM +0300, Max Gurtovoy wrote:
On 9/6/2021 6:09 PM, Stefan Hajnoczi wrote:
On Wed, Sep 01, 2021 at 04:14:34PM +0300, Max Gurtovoy wrote:
No need to pre-allocate a big buffer for the IO SGL anymore. If a device
has lots of deep queues, preallocation for the sg list can consume
substantial amounts of memory. For HW virtio-blk device, nr_hw_queues
can be 64 or 128 and each queue's depth might be 128. This means the
resulting preallocation for the data SGLs is big.

Switch to runtime allocation for SGL for lists longer than 2 entries.
This is the approach used by NVMe drivers so it should be reasonable for
virtio block as well. Runtime SGL allocation has always been the case
for the legacy I/O path so this is nothing new.

The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
support SG_CHAIN, use only runtime allocation for the SGL.

Re-organize the setup of the IO request to fit the new sg chain
mechanism.

No performance degradation was seen (fio libaio engine with 16 jobs and
128 iodepth):

IO size      IOPs Rand Read (before/after)         IOPs Rand Write (before/after)
--------     ---------------------------------    ----------------------------------
512B          318K/316K                                    329K/325K

4KB           323K/321K                                    353K/349K

16KB          199K/208K                                    250K/275K

128KB         36K/36.1K                                    39.2K/41.7K
I ran fio randread benchmarks with 4k, 16k, 64k, and 128k at iodepth 1,
8, and 64 on two vCPUs. The results look fine, there is no significant
regression.

iodepth=1 and iodepth=64 are very consistent. For some reason the
iodepth=8 has significant variance but I don't think it's the fault of
this patch.

Fio results and the Jupyter notebook export are available here (check
out benchmark.html to see the graphs):

https://gitlab.com/stefanha/virt-playbooks/-/tree/virtio-blk-sgl-allocation-benchmark/notebook

Guest:
- Fedora 34
- Linux v5.14
- 2 vCPUs (pinned), 4 GB RAM (single host NUMA node)
- 1 IOThread (pinned)
- virtio-blk aio=native,cache=none,format=raw
- QEMU 6.1.0

Host:
- RHEL 8.3
- Linux 4.18.0-240.22.1.el8_3.x86_64
- Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
- Intel Optane DC P4800X

Stefan
Thanks, Stefan.

Would you like me to add some of the results in my commit msg ? or Tested-By
sign ?
Thanks, there's no need to change the commit description.

Reviewed-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx>
Tested-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx>