Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




在 2024/7/30 21:11, Christian König 写道:
Am 30.07.24 um 13:36 schrieb Huan Yang:
Either drop the whole approach or change udmabuf to do what you want to do.
OK, if so, do I need to send a patch to make dma-buf support sendfile?

Well the udmabuf approach doesn't need to use sendfile, so no.

Get it, I'll not send again.

About udmabuf, I test find it can't support larget find read due to page array alloc.

I already upload this patch, but do not recive answer.

https://lore.kernel.org/all/20240725021349.580574-1-link@xxxxxxxx/

Is there anything wrong with my understanding of it?

No, that patch was totally fine. Not getting a response is usually something good.

In other words when maintainer see something which won't work at all they immediately react, but when nobody complains it usually means you are on the right track.
Thank you for your answer.

As long as nobody has any good arguments against it I'm happy to take that one upstream through drm-misc-next immediately since it's clearly a stand a lone improvement on it's own.

OK, well to know this.

Thank you


Regards,
Christian.



Regards,
Christian.



Apart from that I don't see a doable way which can be accepted into the kernel.
Thanks for your suggestion.

Regards,
Christian.


Regards,
Christian.


Patch 1 implement it.

Patch 2-5 provides an approach for performance improvement.

The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
synchronously read files using direct I/O.

This approach helps to save CPU copying and avoid a certain degree of
memory thrashing (page cache generation and reclamation)

When dealing with large file sizes, the benefits of this approach become
particularly significant.

However, there are currently some methods that can improve performance,
not just save system resources:

Due to the large file size, for example, a AI 7B model of around 3.4GB, the time taken to allocate DMA-BUF memory will be relatively long. Waiting for the allocation to complete before reading the file will add to the
overall time consumption. Therefore, the total time for DMA-BUF
allocation and file read can be calculated using the formula
    T(total) = T(alloc) + T(I/O)

However, if we change our approach, we don't necessarily need to wait for the DMA-BUF allocation to complete before initiating I/O. In fact, during the allocation process, we already hold a portion of the page, which means that waiting for subsequent page allocations to complete before carrying out file reads is actually unfair to the pages that have
already been allocated.

The allocation of pages is sequential, and the reading of the file is also sequential, with the content and size corresponding to the file.
This means that the memory location for each page, which holds the
content of a specific position in the file, can be determined at the
time of allocation.

However, to fully leverage I/O performance, it is best to wait and
gather a certain number of pages before initiating batch processing.

The default gather size is 128MB. So, ever gathered can see as a file read work, it maps the gather page to the vmalloc area to obtain a continuous virtual address, which is used as a buffer to store the contents of the corresponding file. So, if using direct I/O to read a file, the file content will be written directly to the corresponding dma-buf buffer memory
without any additional copying.(compare to pipe buffer.)

Consider other ways to read into dma-buf. If we assume reading after mmap dma-buf, we need to map the pages of the dma-buf to the user virtual
address space. Also, udmabuf memfd need do this operations too.
Even if we support sendfile, the file copy also need buffer, you must
setup it.
So, mapping pages to the vmalloc area does not incur any additional
performance overhead compared to other methods.[6]

Certainly, the administrator can also modify the gather size through patch5.

The formula for the time taken for system_heap buffer allocation and
file reading through async_read is as follows:

   T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))

Compared to the synchronous read:
   T(total) = T(alloc) + T(I/O)

If the allocation time or I/O time is long, the time difference will be covered by the maximum value between the allocation and I/O. The other
party will be concealed.

Therefore, the larger the size of the file that needs to be read, the
greater the corresponding benefits will be.

How to use
===
Consider the current pathway for loading model files into DMA-BUF:
   1. open dma-heap, get heap fd
   2. open file, get file_fd(can't use O_DIRECT)
   3. use file len to allocate dma-buf, get dma-buf fd
   4. mmap dma-buf fd, get vaddr
   5. read(file_fd, vaddr, file_size) into dma-buf pages
   6. share, attach, whatever you want

Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
   1. open dma-heap, get heap fd
   2. open file, get file_fd(buffer/direct)
   3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
      instead of len. get dma-buf fd(contains file content)
   4. share, attach, whatever you want

So, test it is easy.

How to test
===
The performance comparison will be conducted for the following scenarios:
   1. normal
   2. udmabuf with [3] patch
   3. sendfile
   4. only patch 1
   5. patch1 - patch4.

normal:
   1. open dma-heap, get heap fd
   2. open file, get file_fd(can't use O_DIRECT)
   3. use file len to allocate dma-buf, get dma-buf fd
   4. mmap dma-buf fd, get vaddr
   5. read(file_fd, vaddr, file_size) into dma-buf pages
   6. share, attach, whatever you want

UDMA-BUF step:
   1. memfd_create
   2. open file(buffer/direct)
   3. udmabuf create
   4. mmap memfd
   5. read file into memfd vaddr

Sendfile step(need suit splice_write/write_iter, just use to compare):
   1. open dma-heap, get heap fd
   2. open file, get file_fd(buffer/direct)
   3. use file len to allocate dma-buf, get dma-buf fd
   4. sendfile file_fd to dma-buf fd
   6. share, attach, whatever you want

patch1/patch1-4:
   1. open dma-heap, get heap fd
   2. open file, get file_fd(buffer/direct)
   3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
      instead of len. get dma-buf fd(contains file content)
   4. share, attach, whatever you want

You can create a file to test it. Compare the performance gap between the two. It is best to compare the differences in file size from KB to MB to GB.

The following test data will compare the performance differences between 512KB,
8MB, 1GB, and 3GB under various scenarios.

Performance Test
===
   12G RAM phone
   UFS4.0(the maximum speed is 4GB/s. ),
   f2fs
   kernel 6.1 with patch[7] (or else, can't support kvec direct I/O read.)
   no memory pressure.
   drop_cache is used for each test.

The average of 5 test results:
| scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 3GB(ns)       | | ------------------- | ---------- | ---------- | ------------- | ------------- | | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 3,332,438,754 | | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 2,108,419,923 | | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 3,062,052,984 | | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 2,187,570,861 | | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 9,777,661,077 | | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 5,648,897,554 | | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 2,158,305,738 | | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 1,400,006,107 |

With this test, sendfile can't give a good help base on pipe buffer.

udmabuf is good, but I think our oem driver can't suit it. (And, AOSP do not open this feature)


Anyway, I am sending this patchset in the hope of further discussion.

Thanks.


So, based on the test results:

When the file is large, the patchset has the highest performance.
Compared to normal, patchset is a 50% improvement;
Compared to normal, patch1 only showed a degradation of 41%.
patch1 typical performance breakdown is as follows:
   1. alloc cost 188,802,693 ns
   2. vmap cost 42,491,385 ns
   3. file read cost 4,180,876,702 ns
Therefore, directly performing a single direct I/O read on a large file
may not be the most optimal way for performance.

The performance of direct I/O implemented by the sendfile method is the worst.

When file size is small, The difference in performance is not
significant. This is consistent with expectations.



Suggested use cases
===
   1. When there is a need to read large files and system resources are scarce,       especially when the size of memory is limited.(GB level) In this       scenario, using direct I/O for file reading can even bring performance
      improvements.(may need patch2-3)
   2. For embedded devices with limited RAM, using direct I/O can save system       resources and avoid unnecessary data copying. Therefore, even if the       performance is lower when read small file, it can still be used
      effectively.
   3. If there is sufficient memory, pinning the page cache of the model files       in memory and placing file in the EROFS file system for read-only access
      maybe better.(EROFS do not support direct I/O)


Changlog
===
  v1 [8]
  v1->v2:
    Uses the heap flag method for alloc and read instead of adding a new
    DMA-buf ioctl command. [9]
    Split the patchset to facilitate review and test.
      patch 1 implement alloc and read, offer heap flag into it.
      patch 2-4 offer async read
      patch 5 can change gather limit.

Reference
===
[1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@xxxxxxx/ [2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/ [3] https://lore.kernel.org/all/20240725021349.580574-1-link@xxxxxxxx/ [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@xxxxxxxxxxxxx/ [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@xxxxxxxxxxxxx/ [6] https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@xxxxxxx/ [7] https://patchew.org/linux/20230209102954.528942-1-dhowells@xxxxxxxxxx/20230209102954.528942-7-dhowells@xxxxxxxxxx/ [8] https://lore.kernel.org/all/20240711074221.459589-1-link@xxxxxxxx/ [9] https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@xxxxxxx/

Huan Yang (5):
   dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
   dma-buf: heaps: Introduce async alloc read ops
   dma-buf: heaps: support alloc async read file
   dma-buf: heaps: system_heap alloc support async read
   dma-buf: heaps: configurable async read gather limit

  drivers/dma-buf/dma-heap.c          | 552 +++++++++++++++++++++++++++-
  drivers/dma-buf/heaps/system_heap.c |  70 +++-
  include/linux/dma-heap.h            |  53 ++-
  include/uapi/linux/dma-heap.h       |  11 +-
  4 files changed, 673 insertions(+), 13 deletions(-)


base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890







[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux