Backgroud ==== We are currently facing some challenges when loading the model file into DMA-BUF. 1. Our camera application algorithm model has reached the 1GB level. 2. Our AI application's 3B model has reached the 1GB level, and the 7B model has reached the 3GB level. The above-mentioned internal applications all require reading the model files into dma-buf for sharing between the CPU and DMA devices. Consider the current pathway for loading model files into DMA-BUF: 1. open dma-heap, get heap fd 2. open file, get fd 3. allocate dma-buf, get dma-buf fd 4. mmap dma-buf fd, get vaddr 5. read(file_fd, vaddr, file_size) into dma-buf pages 6. share, attach, whatever you want IMO, The above process involves two inefficient behaviors: 1. we need to wait dma-buf allocate success, and then load file into. 2. dma-buf load file need through page cache As I mentioned above, we currently have scenarios where we need to load files of gigabyte size into DMA-BUF. That's mean: 1. dma-buf also need to be GB size, so, if avaliable memory is not enough, we need enter slowpath and wait. If we use already allocated memory to load file, it can save time by using a parallel approach. 2. GB is too heavy, the page cache is useless for boost file load.(it will be recycled quickly.) And we need double copy to load it into dma-buf. a) load file into page cache b) memcpy from page cache to dma-buf DMA_HEAP_IOCTL_ALLOC_AND_READ === This patchset implements a new ioctl, DMA_HEAP_IOCTL_ALLOC_AND_READ, which can be used to allocate and read a file into a dma-buf in a single operation. This ioctl is similar to DMA_HEAP_IOCTL_ALLOC, but it also reads the file into the dma-buf. Different from DMA_HEAP_IOCTL_ALLOC, the user does not need to pass the size of the dma-buf, but rather the file descriptor of the opened file. User also can offer a `batch`, so if memory allocated reach to it, trigger IO, default is 128MB. Both buffered I/O and direct I/O(O_DIRECT) can be accepted, but if file size reach to GB, I will warn you if you use buffered I/O. In kernel space, heap_fwork_t kthread used to comsume all produced file read work, this is single thread for read.(Due to heavy size read, multi-thread may helpless). Reference === Currently, we have many patches that aim to make dma-buf support direct I/O in userspace. Recently liu's work: https://lore.kernel.org/all/20240710140948.25870-1-liulei.rjpt@xxxxxxxx/ However, this patch is not focused on enabling dma-buf to perform direct I/O in userspace. The main goal is to ensure that dma-buf completes the file memory loading when the allocation is completed. Buffered I/O and direct I/O are both methods to end file read. Performance === Here a some self-test result: dd a 3GB file for test, 12G RAM phone, UFS4.0, no memory pressure. MemTotal: 11583824 kB MemFree: 2307972 kB MemAvailable: 7287640 kB Notice, mtk_mm-uncached is our phone heap, you can use system_heap or other to test.(need suit DMA_HEAP_IOCTL_ALLOC_AND_READ) 1. original ```shel # create a model file dd if=/dev/zero of=./model.txt bs=1M count=3072 # drop page cache echo 3 > /proc/sys/vm/drop_caches ./dmabuf-heap-file-read mtk_mm-uncached normal > result is total cost 2370513769ns ``` 2.DMA_HEAP_IOCTL_ALLOC_AND_READ O_DIRECT ```shel # create a model file dd if=/dev/zero of=./model.txt bs=1M count=3072 # drop page cache echo 3 > /proc/sys/vm/drop_caches ./dmabuf-heap-file-read mtk_mm-uncached direct_io > result is total cost 1269239770ns # use direct_io_check can check the content if is same to file. ``` 3. DMA_HEAP_IOCTL_ALLOC_AND_READ BUFFER I/O ```shel # create a model file dd if=/dev/zero of=./model.txt bs=1M count=3072 # drop page cache echo 3 > /proc/sys/vm/drop_caches ./dmabuf-heap-file-read mtk_mm-uncached normal_io > result is total cost 2268621769ns ``` ------------------ dd a 3GB file for test, 12G RAM phone, UFS4.0, stressapptest 4G memory pressure. 1. original ```shel # create a model file dd if=/dev/zero of=./model.txt bs=1M count=3072 # drop page cache echo 3 > /proc/sys/vm/drop_caches ./dmabuf-heap-file-read mtk_mm-uncached normal > result is total cost 13087213847ns ``` 2.DMA_HEAP_IOCTL_ALLOC_AND_READ O_DIRECT ```shel # create a model file dd if=/dev/zero of=./model.txt bs=1M count=3072 # drop page cache echo 3 > /proc/sys/vm/drop_caches ./dmabuf-heap-file-read mtk_mm-uncached direct_io > result is total cost 2902386846ns # use direct_io_check can check the content if is same to file. ``` 3. DMA_HEAP_IOCTL_ALLOC_AND_READ BUFFER I/O ```shel # create a model file dd if=/dev/zero of=./model.txt bs=1M count=3072 # drop page cache echo 3 > /proc/sys/vm/drop_caches ./dmabuf-heap-file-read mtk_mm-uncached normal_io > result is total cost 5735579385ns ``` Can see, use O_DIRECT can improve 50% performance. Even buffered I/O, also can improve a little. If given memory pressure, the performance gap will become more significant. Here are the test file which you can build by self. ```c #include <dirent.h> #include <errno.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <time.h> #include <unistd.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> #include <dirent.h> #define HEAP_DEVPATH "/dev/dma_heap" #define TEST_FILE "./model.txt" enum { NORMAL_DMABUF_TEST, NORMAL_IO_DMABUF_TEST, DIRECT_IO_DMABUF_TEST, DIRECT_IO_DMABUF_CHECK_TEST, }; #define assert(as) \ if (!(as)) { \ printf("%s is failed\n", #as); \ exit(-1); \ } int dmabuf_heap_open(char* name) { int ret, fd; char buf[256]; ret = sprintf(buf, "%s/%s", HEAP_DEVPATH, name); if (ret < 0) { printf("sprintf failed!\n"); return ret; } fd = open(buf, O_RDWR); if (fd < 0) printf("open %s failed!\n", buf); return fd; } int dmabuf_heap_alloc_read_file(int heap_fd, int file_fd, unsigned int flags, int* dmabuf_fd) { struct dma_heap_allocation_file_data data = { .file_fd = file_fd, .fd_flags = O_RDWR | O_CLOEXEC, .heap_flags = flags, }; int ret; if (dmabuf_fd == NULL) return -EINVAL; ret = ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC_AND_READ, &data); if (ret < 0) return ret; *dmabuf_fd = (int)data.fd; return ret; } int dmabuf_heap_alloc(int fd, size_t len, unsigned int flags, int* dmabuf_fd) { struct dma_heap_allocation_data data = { .len = len, .fd_flags = O_RDWR | O_CLOEXEC, .heap_flags = flags, }; int ret; if (dmabuf_fd == NULL) return -EINVAL; ret = ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data); if (ret < 0) return ret; *dmabuf_fd = (int)data.fd; return ret; } void dmabuf_heap_test(int type, char *heap_name) { int heapfd = dmabuf_heap_open(heap_name); assert(heapfd > 0); if (type == NORMAL_DMABUF_TEST) { int file_fd = open(TEST_FILE, O_RDONLY); unsigned long fsize; int dma_buf_fd; struct stat ftat; fstat(file_fd, &ftat); fsize = ftat.st_size; dmabuf_heap_alloc(heapfd, fsize, 0, &dma_buf_fd); assert(dma_buf_fd > 0); void *file_addr = mmap(NULL, fsize, PROT_READ, MAP_SHARED, file_fd, 0); assert(file_addr != MAP_FAILED); void *dma_buf_addr = mmap(NULL, fsize, PROT_WRITE, MAP_SHARED, dma_buf_fd, 0); assert(dma_buf_addr != MAP_FAILED); memcpy(dma_buf_addr, file_addr, fsize); munmap(file_addr, fsize); munmap(dma_buf_addr, fsize); close(file_fd); close(dma_buf_fd); } else { int file_fd; if (type == NORMAL_IO_DMABUF_TEST) file_fd = open(TEST_FILE, O_RDONLY); else file_fd = open(TEST_FILE, O_RDONLY | O_DIRECT); int dma_buf_fd; dmabuf_heap_alloc_read_file(heapfd, file_fd, 0, &dma_buf_fd); assert(dma_buf_fd > 0); if (type == DIRECT_IO_DMABUF_CHECK_TEST) { struct stat ftat; fstat(file_fd, &ftat); unsigned long size = ftat.st_size; char *dmabuf_addr = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED, dma_buf_fd, 0); assert(dmabuf_addr != NULL); char *file_addr = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED, file_fd, 0); assert(file_addr != NULL); unsigned long i = 0; for (; i < size; i += 4096) { if (memcmp(&dmabuf_addr[i], &file_addr[i], 4096) != 0) printf("cur %lu comp size %d\n", i, size); assert (memcmp(&dmabuf_addr[i], &file_addr[i], 4096) == 0); } munmap(dmabuf_addr, size); munmap(file_addr, size); } close(file_fd); close(dma_buf_fd); } close(heapfd); } int main(int argc, char* argv[]) { char* dmabuf_heap_name; char* type_name; int type; struct timespec ts_start, ts_end; long long start, end; if (argc < 3) { printf("input heap name, copy or trans or normal\n"); } dmabuf_heap_name = argv[1]; type_name = argv[2]; if (strcmp(type_name, "normal") == 0) type = NORMAL_DMABUF_TEST; else if (strcmp(type_name, "direct_io") == 0) type = DIRECT_IO_DMABUF_TEST; else if (strcmp(type_name, "direct_io_check") == 0) type = DIRECT_IO_DMABUF_CHECK_TEST; else if (strcmp(type_name, "normal_io") == 0) type = NORMAL_IO_DMABUF_TEST; else exit(-1); printf("Testing dmabuf %s", dmabuf_heap_name); printf("\n---------------------------------------------\n"); clock_gettime(CLOCK_MONOTONIC, &ts_start); dmabuf_heap_test(type, dmabuf_heap_name); clock_gettime(CLOCK_MONOTONIC, &ts_end); start = ts_start.tv_sec * 1000000000 + ts_start.tv_nsec; end = ts_end.tv_sec * 1000000000 + ts_end.tv_nsec; printf("total cost %lldns\n", end - start); return 0; } ``` Huan Yang (2): dma-buf: heaps: DMA_HEAP_IOCTL_ALLOC_READ_FILE framework dma-buf: heaps: system_heap support DMA_HEAP_IOCTL_ALLOC_AND_READ drivers/dma-buf/dma-heap.c | 525 +++++++++++++++++++++++++++- drivers/dma-buf/heaps/system_heap.c | 53 ++- include/linux/dma-heap.h | 57 ++- include/uapi/linux/dma-heap.h | 32 ++ 4 files changed, 660 insertions(+), 7 deletions(-) base-commit: 523b23f0bee3014a7a752c9bb9f5c54f0eddae88 -- 2.45.2