On 5/16/2022 3:34 PM, Christoph Hellwig wrote:
I don't really understand how 'childs' fit in here. The code also
doesn't seem to be usable without patch 2 and a caller of the
new functions added in patch 2, so it is rather impossible to review.
Hi Christoph:
OK. I will merge two patches and add a caller patch. The motivation
is to avoid global spin lock when devices use swiotlb bounce buffer and
this introduces overhead during high throughput cases. In my test
environment, current code can achieve about 24Gb/s network throughput
with SWIOTLB force enabled and it can achieve about 40Gb/s without
SWIOTLB force. Storage also has the same issue.
Per-device IO TLB mem may resolve global spin lock issue among
devices but device still may have multi queues. Multi queues still need
to share one spin lock. This is why introduce child or IO tlb areas in
the previous patches. Each device queues will have separate child IO TLB
mem and single spin lock to manage their IO TLB buffers.
Otherwise, global spin lock still cost cpu usage during high
throughput even when there is performance regression. Each device queues
needs to spin on the different cpus to acquire the global lock. Child IO
TLB mem also may resolve the cpu issue.
Also:
1) why is SEV/TDX so different from other cases that need bounce
buffering to treat it different and we can't work on a general
scalability improvement
Other cases also have global spin lock issue but it depends on
whether hits the bottleneck. The cpu usage issue may be ignored.
2) per previous discussions at how swiotlb itself works, it is
clear that another option is to just make pages we DMA to
shared with the hypervisor. Why don't we try that at least
for larger I/O?
For confidential VM(Both TDX and SEV), we need to use bounce
buffer to copy between private memory that hypervisor can't
access directly and shared memory. For security consideration,
confidential VM should not share IO stack DMA pages with
hypervisor directly to avoid attack from hypervisor when IO
stack handles the DMA data.