Patch 1 implement it.
Patch 2-5 provides an approach for performance improvement.
The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
synchronously read files using direct I/O.
This approach helps to save CPU copying and avoid a certain
degree of
memory thrashing (page cache generation and reclamation)
When dealing with large file sizes, the benefits of this
approach become
particularly significant.
However, there are currently some methods that can improve
performance,
not just save system resources:
Due to the large file size, for example, a AI 7B model of around
3.4GB, the
time taken to allocate DMA-BUF memory will be relatively long.
Waiting
for the allocation to complete before reading the file will add
to the
overall time consumption. Therefore, the total time for DMA-BUF
allocation and file read can be calculated using the formula
T(total) = T(alloc) + T(I/O)
However, if we change our approach, we don't necessarily need to
wait
for the DMA-BUF allocation to complete before initiating I/O. In
fact,
during the allocation process, we already hold a portion of the
page,
which means that waiting for subsequent page allocations to
complete
before carrying out file reads is actually unfair to the pages
that have
already been allocated.
The allocation of pages is sequential, and the reading of the
file is
also sequential, with the content and size corresponding to the
file.
This means that the memory location for each page, which holds the
content of a specific position in the file, can be determined at
the
time of allocation.
However, to fully leverage I/O performance, it is best to wait and
gather a certain number of pages before initiating batch
processing.
The default gather size is 128MB. So, ever gathered can see as a
file read
work, it maps the gather page to the vmalloc area to obtain a
continuous
virtual address, which is used as a buffer to store the contents
of the
corresponding file. So, if using direct I/O to read a file, the
file
content will be written directly to the corresponding dma-buf
buffer memory
without any additional copying.(compare to pipe buffer.)
Consider other ways to read into dma-buf. If we assume reading
after mmap
dma-buf, we need to map the pages of the dma-buf to the user
virtual
address space. Also, udmabuf memfd need do this operations too.
Even if we support sendfile, the file copy also need buffer, you
must
setup it.
So, mapping pages to the vmalloc area does not incur any additional
performance overhead compared to other methods.[6]
Certainly, the administrator can also modify the gather size
through patch5.
The formula for the time taken for system_heap buffer allocation
and
file reading through async_read is as follows:
T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
Compared to the synchronous read:
T(total) = T(alloc) + T(I/O)
If the allocation time or I/O time is long, the time difference
will be
covered by the maximum value between the allocation and I/O. The
other
party will be concealed.
Therefore, the larger the size of the file that needs to be
read, the
greater the corresponding benefits will be.
How to use
===
Consider the current pathway for loading model files into DMA-BUF:
1. open dma-heap, get heap fd
2. open file, get file_fd(can't use O_DIRECT)
3. use file len to allocate dma-buf, get dma-buf fd
4. mmap dma-buf fd, get vaddr
5. read(file_fd, vaddr, file_size) into dma-buf pages
6. share, attach, whatever you want
Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
1. open dma-heap, get heap fd
2. open file, get file_fd(buffer/direct)
3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap
flag, set file_fd
instead of len. get dma-buf fd(contains file content)
4. share, attach, whatever you want
So, test it is easy.
How to test
===
The performance comparison will be conducted for the following
scenarios:
1. normal
2. udmabuf with [3] patch
3. sendfile
4. only patch 1
5. patch1 - patch4.
normal:
1. open dma-heap, get heap fd
2. open file, get file_fd(can't use O_DIRECT)
3. use file len to allocate dma-buf, get dma-buf fd
4. mmap dma-buf fd, get vaddr
5. read(file_fd, vaddr, file_size) into dma-buf pages
6. share, attach, whatever you want
UDMA-BUF step:
1. memfd_create
2. open file(buffer/direct)
3. udmabuf create
4. mmap memfd
5. read file into memfd vaddr
Sendfile step(need suit splice_write/write_iter, just use to
compare):
1. open dma-heap, get heap fd
2. open file, get file_fd(buffer/direct)
3. use file len to allocate dma-buf, get dma-buf fd
4. sendfile file_fd to dma-buf fd
6. share, attach, whatever you want
patch1/patch1-4:
1. open dma-heap, get heap fd
2. open file, get file_fd(buffer/direct)
3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap
flag, set file_fd
instead of len. get dma-buf fd(contains file content)
4. share, attach, whatever you want
You can create a file to test it. Compare the performance gap
between the two.
It is best to compare the differences in file size from KB to MB
to GB.
The following test data will compare the performance differences
between 512KB,
8MB, 1GB, and 3GB under various scenarios.
Performance Test
===
12G RAM phone
UFS4.0(the maximum speed is 4GB/s. ),
f2fs
kernel 6.1 with patch[7] (or else, can't support kvec direct
I/O read.)
no memory pressure.
drop_cache is used for each test.
The average of 5 test results:
| scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) |
3GB(ns) |
| ------------------- | ---------- | ---------- | -------------
| ------------- |
| normal | 2,790,861 | 14,535,784 | 1,520,790,492
| 3,332,438,754 |
| udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 |
2,108,419,923 |
| sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938
| 3,062,052,984 |
| patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 |
2,187,570,861 |
| sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184
| 9,777,661,077 |
| patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877
| 5,648,897,554 |
| udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 |
2,158,305,738 |
| patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 |
1,400,006,107 |