Re: [PATCH 1/2] dma-buf: heaps: DMA_HEAP_IOCTL_ALLOC_READ_FILE framework

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




在 2024/7/16 20:07, Christian König 写道:
Am 16.07.24 um 11:31 schrieb Daniel Vetter:
On Tue, Jul 16, 2024 at 10:48:40AM +0800, Huan Yang wrote:
I just research the udmabuf, Please correct me if I'm wrong.

在 2024/7/15 20:32, Christian König 写道:
Am 15.07.24 um 11:11 schrieb Daniel Vetter:
On Thu, Jul 11, 2024 at 11:00:02AM +0200, Christian König wrote:
Am 11.07.24 um 09:42 schrieb Huan Yang:
Some user may need load file into dma-buf, current
way is:
     1. allocate a dma-buf, get dma-buf fd
     2. mmap dma-buf fd into vaddr
     3. read(file_fd, vaddr, fsz)
This is too heavy if fsz reached to GB.
You need to describe a bit more why that is to heavy. I can only
assume you
need to save memory bandwidth and avoid the extra copy with the CPU.

This patch implement a feature called DMA_HEAP_IOCTL_ALLOC_READ_FILE.
User need to offer a file_fd which you want to load into
dma-buf, then,
it promise if you got a dma-buf fd, it will contains the file content.
Interesting idea, that has at least more potential than trying
to enable
direct I/O on mmap()ed DMA-bufs.

The approach with the new IOCTL might not work because it is a very
specialized use case.

But IIRC there was a copy_file_range callback in the file_operations
structure you could use for that. I'm just not sure when and how
that's used
with the copy_file_range() system call.
I'm not sure any of those help, because internally they're all still
based
on struct page (or maybe in the future on folios). And that's the thing
dma-buf can't give you, at least without peaking behind the curtain.

I think an entirely different option would be malloc+udmabuf. That
essentially handles the impendence-mismatch between direct I/O and
dma-buf
on the dma-buf side. The downside is that it'll make the permanently
pinned memory accounting and tracking issues even more apparent, but I
guess eventually we do need to sort that one out.
Oh, very good idea!
Just one minor correction: it's not malloc+udmabuf, but rather
create_memfd()+udmabuf.
Hm right, it's create_memfd() + mmap(memfd) + udmabuf

And you need to complete your direct I/O before creating the udmabuf
since that reference will prevent direct I/O from working.
udmabuf will pin all pages, so, if returned fd, can't trigger direct I/O
(same as dmabuf). So, must complete read before pin it.
Why does pinning prevent direct I/O? I haven't tested, but I'd expect the
rdma folks would be really annoyed if that's the case ...

I used to believe that a pinned page cannot be re-pinned, so performing direct I/O on it would fail.  But I misunderstood, and it doesn't have any impact.

dma-buf mmap vaddr can't to trigger direct I/O due to can't pin kernel page(PFN), So, not same.



Pinning (or rather taking another page reference) prevents writes from using direct I/O because writes try to find all references and make them read only so that nobody modifies the content while the write is done.

As far as I know the same approach is used for NUMA migration and replacing small pages with big ones in THP. But for the read case here it should still work.

Hmm, with udmabuf direct I/O test, I find this will not effect it. Test code  I set in email tail. Maybe pin only let page can't be reclaimed, rather prevent the write?



With mine test, udmabuf direct I/O read 3GB file, average cost 2.2s.(I use ftrace to trace f2fs_direct_IO can make sure direct IO trigger success),  Same as mine normal cache file read cost

My patchset average is 1.2s,The difference between the two was obvious before.


But current way is use `memfd_pin_folios` to boost alloc and pin, so maybe
need suit it.


I currently doubt that the udmabuf solution is suitable for our
gigabyte-level read operations.

1. The current mmap operation uses faulting, so frequent page faults will be
triggered during reads, resulting in a lot of context switching overhead.

2. current udmabuf size limit is 64MB, even can change, maybe not good to
use in large size?
Yeah that's just a figleaf so we don't have to bother about the accounting
issue.

3. The migration and adaptation of the driver is also a challenge, and
currently, we are unable to control it.
Why does a udmabuf fd not work instead of any other dmabuf fd? That
shouldn't matter for the consuming driver ...

Perhaps implementing `copy_file_range` would be more suitable for us.
See my other mail, fundamentally these all rely on struct page being
present, and dma-buf doesn't give you that. Which means you need to go
below the dma-buf abstraction. And udmabuf is pretty much the thing for
that, because it wraps normal struct page memory into a dmabuf.

And copy_file_range on the underlying memfd might already work, I haven't
checked though.

Yeah completely agree.

Regards,
Christian.

Cheers, Sima

Test code, if test above 2GB, need this patch:

https://lore.kernel.org/all/20240717065444.369876-1-link@xxxxxxxx/

```c

// SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#define __EXPORTED_HEADERS__

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <dirent.h>
#include <malloc.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <sys/mman.h>
#include <time.h>
#include <linux/memfd.h>
#include <linux/udmabuf.h>

#define TEST_PREFIX    "drivers/dma-buf/udmabuf"

// static int memfd_create(const char *name, unsigned int flags)
// {
//     return syscall(__NR_memfd_create, name, flags);
// }

int main(void)
{
    struct udmabuf_create create;
    int devfd, memfd, buf, ret;
    unsigned long size;
        int filefd;
        struct timespec ts_start, ts_end;
    long long start, end;

        clock_gettime(CLOCK_MONOTONIC, &ts_start);

    devfd = open("/dev/udmabuf", O_RDWR);
    if (devfd < 0) {
        printf("%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n",
               TEST_PREFIX);
        exit(77);
    }

    memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING);
    if (memfd < 0) {
        printf("%s: [skip,no-memfd]\n", TEST_PREFIX);
        exit(77);
    }

    ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
    if (ret < 0) {
        printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
        exit(77);
    }

        filefd = open("/data/model.txt", O_RDONLY | O_DIRECT);
        if (filefd < 0) {
                printf("%s: [failed to open model.txt]\n", TEST_PREFIX);
                exit(77);
        }

        struct stat ftat;
        fstat(filefd, &ftat);
        size = (ftat.st_size + getpagesize()) & ~(getpagesize());

    ret = ftruncate(memfd, size);
    if (ret == -1) {
        printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
        exit(1);
    }

    memset(&create, 0, sizeof(create));

    /* should work */
    create.memfd  = memfd;
    create.offset = 0;
    create.size   = size;
    buf = ioctl(devfd, UDMABUF_CREATE, &create);
    if (buf < 0) {
        printf("%s: [FAIL,test-4]\n", TEST_PREFIX);
        exit(1);
    }

    // fprintf(stderr, "%s: ok\n", TEST_PREFIX);

        void *vaddr = mmap(NULL, size, PROT_WRITE | PROT_READ,
                         MAP_SHARED, memfd, 0);
        if (!vaddr) {
                printf("%s: [FAIL, mmap]\n", TEST_PREFIX);
                exit(77);
        }

        unsigned long rsize = size;
        unsigned long bytes = 0;
        while (bytes != size) {
                ssize_t rb = read(filefd, vaddr, rsize);
                if (rb < 0) {
                        printf("%s: [FAIL, read]\n", TEST_PREFIX);
                        exit(77);
                }
                rsize -= rb;
                bytes += rb;

        }
        munmap(vaddr, size);
        clock_gettime(CLOCK_MONOTONIC, &ts_end);

#define NSEC_PER_SEC 1000000000LL
        start = ts_start.tv_sec * NSEC_PER_SEC + ts_start.tv_nsec;
        end = ts_end.tv_sec * NSEC_PER_SEC + ts_end.tv_nsec;

        printf("total cost %lld ns\n", end - start);

        printf("going to check content\n");
        void *fvaddr = mmap(NULL, size, PROT_READ, MAP_SHARED, filefd, 0);
        if (!fvaddr) {
                printf("%s: [FAIL, mmap file]\n", TEST_PREFIX);
                exit(77);
        }
        vaddr = mmap(NULL, size, PROT_READ, MAP_SHARED, buf, 0);
        if (!vaddr) {
                printf("%s: [FAIL, mmap dmabuf]\n", TEST_PREFIX);
                exit(77);
        }

        if (memcmp(fvaddr, vaddr, size) != 0) {
                printf("%s: [FAIL, content is not same]\n", TEST_PREFIX);
                exit(77);
        }

        printf("%s: [SUCCESS, content is same]\n", TEST_PREFIX);
        munmap(vaddr, size);
        munmap(fvaddr, size);
        close(filefd);
    close(buf);
    close(memfd);
    close(devfd);
    return 0;
}

```




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux