io_uring: worker thread NULL dereference during openat op

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Below is a test program that causes multiple io_uring worker threads to hit a NULL dereference while executing openat ops.

The test program hangs forever in a D state. The test program can be run again after the NULL dereferences. However, there are long delays at reboot time because the io_uring_cancel() during do_exit() attempts to wake the worker threads.

The test program is single threaded but it queues multiple openat and close ops with IOSQE_ASYNC set before waiting for completions.

I collected trace with /sys/kernel/tracing/events/io_uring/enable enabled if that is helpful.

The test program reproduces similar problems in the following releases.

mainline v6.9-rc3
stable 6.8.5
Ubuntu 6.5.0-1018-azure

The test program does not reproduce the problem in Ubuntu 5.15.0-1052-azure, which does not have the io_uring audit changes.

The following is the first io_uring worker thread backtrace in the repro against v6.9-rc3.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 4628 Comm: iou-wrk-4605 Not tainted 6.9.0-rc3 #2
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/28/2023
RIP: 0010:strlen (lib/string.c:402)
Call Trace:
 <TASK>
? show_regs (arch/x86/kernel/dumpstack.c:479)
? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
? page_fault_oops (arch/x86/mm/fault.c:713)
? do_user_addr_fault (arch/x86/mm/fault.c:1261)
? exc_page_fault (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563)
? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
? __pfx_strlen (lib/string.c:402)
? parent_len (kernel/auditfilter.c:1284)
__audit_inode (kernel/auditsc.c:2381 (discriminator 4))
? link_path_walk.part.0.constprop.0 (fs/namei.c:2324)
path_openat (fs/namei.c:3550 fs/namei.c:3796)
do_filp_open (fs/namei.c:3826)
? alloc_fd (./arch/x86/include/asm/paravirt.h:589 (discriminator 10) ./arch/x86/include/asm/qspinlock.h:57 (discriminator 10) ./include/linux/spinlock.h:204 (discriminator 10) ./include/linux/spinlock_api_smp.h:142 (discriminator 10) ./include/linux/spinlock.h:391 (discriminator 10) fs/file.c:553 (discriminator 10))
io_openat2 (io_uring/openclose.c:140)
io_openat (io_uring/openclose.c:178)
io_issue_sqe (io_uring/io_uring.c:1897)
io_wq_submit_work (io_uring/io_uring.c:2006)
io_worker_handle_work (io_uring/io-wq.c:540 io_uring/io-wq.c:597)
io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
? raw_spin_rq_unlock (./arch/x86/include/asm/paravirt.h:589 ./arch/x86/include/asm/qspinlock.h:57 ./include/linux/spinlock.h:204 ./include/linux/spinlock_api_smp.h:142 kernel/sched/core.c:603) ? finish_task_switch.isra.0 (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:77 kernel/sched/sched.h:1397 kernel/sched/core.c:5163 kernel/sched/core.c:5281)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
ret_from_fork (arch/x86/kernel/process.c:156)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
ret_from_fork_asm (arch/x86/entry/entry_64.S:256)
RIP: 0033:0x0

Test program:

#include <cassert>
#include <fcntl.h>
#include <filesystem>
#include <getopt.h>
#include <iostream>
#include <liburing.h>

// This single threaded test program triggers a kernel BUG in an
// io_uring worker thread.
//
// The --count parameter sets the number of times that the following
// workload is executed:
//
// 1. io_uring_prep_openat(same_filename_each_time)
// 2. io_uring_submit
// 3. io_uring_wait_cqe
// 4. io_uring_prep_close(fd_from_openat_in_step_1)
// 5. io_uring_submit
// 6. io_uring_wait_cqe
//
// The SQEs in steps 1 and 4 have the IOSQE_ASYNC flag set.
//
// The --inflight parameter sets the number of the workloads that are
// executing at the same time.
// Steps 1 and 2 are executed, before step 3, until the inflight limit
// is reached.
//
// The sqe user_data is set to the io_uring op to keep the event_loop()
// simple.
// io_uring_sqe_set_data64() was not available during development.
//
// Warning: This test program may cause system instability and reboot
// may be affected.

// Usage:
//
// sudo apt install -y g++
// sudo apt install -y liburing-dev
// g++ ./io_uring_open_close_inflight.cc -o io_uring_open_close_inflight -luring
// sudo dmesg --clear
// ./io_uring_open_close_inflight --directory /tmp/deleteme --count 1 --inflight 1
// sudo dmesg
//
// A good starting point to reproduce the problem is:
// --count 1000000 --inflight 100

void
submit1(struct io_uring& ring, const std::string& fileName)
{
    int ret;
    struct io_uring_sqe* sqe {};
    uintptr_t user_data {IORING_OP_OPENAT};

    int flags {O_RDWR | O_CREAT};
    mode_t mode {0666};

    sqe = io_uring_get_sqe(&ring);
    assert(sqe != nullptr);

    io_uring_prep_openat(sqe, AT_FDCWD, fileName.data(), flags, mode);
    io_uring_sqe_set_data(sqe, reinterpret_cast<void*>(user_data));
    io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);

    ret = io_uring_submit(&ring);
    assert(ret == 1);
}

void
submit2(struct io_uring& ring, const int fd)
{
    int ret;
    struct io_uring_sqe* sqe {};
    uintptr_t user_data {IORING_OP_CLOSE};

    sqe = io_uring_get_sqe(&ring);
    assert(sqe != nullptr);

    io_uring_prep_close(sqe, fd);
    io_uring_sqe_set_data(sqe, reinterpret_cast<void*>(user_data));
    io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);

    ret = io_uring_submit(&ring);
    assert(ret == 1);
}

void
event_loop(struct io_uring& ring,
           const std::string& filePath,
           const uint32_t count_max,
           const uint32_t inflight_max)
{
    uint32_t inflight {0};
    uint32_t count_phase1 {0};
    uint32_t count_phase2 {0};

    int i {0};
    std::string fileName(filePath + "/file" + std::to_string(i));

    while (count_phase2 < count_max) {
        int ret {0};
        int res {0};
        uintptr_t user_data {};


        while ((count_phase1 < count_max) and (inflight < inflight_max)) {
            if (0 == (count_phase1 % 100)) {
                std::cout << "count=" << std::to_string(count_phase1)
			  << std::endl;
            }
            count_phase1 += 1;
            inflight += 1;
            submit1(ring, fileName);
        }

        {
            struct io_uring_cqe* cqe {};

            ret = io_uring_wait_cqe(&ring, &cqe);
            assert(ret == 0);

            res = cqe->res;
user_data = reinterpret_cast<uintptr_t>(io_uring_cqe_get_data(cqe));

            io_uring_cqe_seen(&ring, cqe);
        }

        switch (user_data) {
        case IORING_OP_OPENAT:
            // res is a file descriptor or -errno.
            assert(res >= 0);
            submit2(ring, res);
            break;
        case IORING_OP_CLOSE:
            // assert that the close succeeded.
            assert(res == 0);
            count_phase2 += 1;
            inflight -= 1;
            break;
        default:
            assert(0);
        }
    }

    assert(count_phase1 == count_max);
    assert(count_phase2 == count_max);
    assert(inflight == 0);
}

void
workload(const std::string& filePath, const uint32_t count_max, const uint32_t inflight_max)
{
    int ret {0};
    struct io_uring ring;

    ret = io_uring_queue_init(2 * inflight_max, &ring, 0);
    assert(0 == ret);

    std::filesystem::create_directory(filePath);

    event_loop(ring, filePath, count_max, inflight_max);

    std::filesystem::remove_all(filePath);

    io_uring_queue_exit(&ring);
}

int
main(int argc, char** argv)
{
    std::string executableName {argv[0]};
    std::string filePath {};
    int count {};
    int inflight {};

    struct option options[]
    {
        {"help", no_argument, 0, 'h'},
        {"directory", required_argument, 0, 'd'},
        {"count", required_argument, 0, 'c'},
        {"inflight", required_argument, 0, 'i'},
        { 0, 0, 0, 0 }
    };
    bool printUsage {false};
    int val {};

while ((val = getopt_long_only(argc, argv, "", options, nullptr)) != -1) {
        if (val == 'h') {
            printUsage = true;
        } else if (val == 'd') {
            filePath = optarg;
            if (std::filesystem::exists(filePath)) {
                printUsage = true;
                std::cerr << "directory must not exist" << std::endl;
            }
        } else if (val == 'c') {
            count = atoi(optarg);
            if (0 == count) {
                printUsage = true;
            }
        } else if (val == 'i') {
            inflight = atoi(optarg);
            if (0 == inflight) {
                printUsage = true;
            }
        } else {
            printUsage = true;
        }
    }

    if ((0 == count) || (0 == inflight) || (filePath.empty())) {
        printUsage = true;
    }

    if (printUsage || (optind < argc)) {
        std::cerr << executableName
            << " --directory DIR --count COUNT --inflight INFLIGHT"
            << std::endl;
        exit(1);
    }

    workload(filePath, count, inflight);
    return 0;
}


Details:

The test program has no io_uring worker threads associated after the NULL dereferences.

root@openat-hang-3:~# date
Mon Apr 15 06:03:22 PM UTC 2024

root@openat-hang-3:~# ps -o pid,state,command -C io_uring_open_close_inflight
    PID S COMMAND
4605 D ./io_uring_open_close_inflight --directory /tmp/deleteme --count 1000000 --inflight 100

root@openat-hang-3:~# ps -L -p 4605
    PID     LWP TTY          TIME CMD
   4605    4605 pts/3    00:00:07 io_uring_open_c

root@openat-hang-3:/# ps -e -L | grep -Ee 'iou[-]wrk[-]' --count
0

10 io_uring worker threads are present in dmesg.

root@openat-hang-3:~# nproc --all
8

root@openat-hang-3:/# grep /tmp/io_uring_dmesg_decode.txt -Ee 'CPU.*iou-wrk-' --count
10

root@openat-hang-3:/# grep /tmp/io_uring_dmesg_decode.txt -Ee 'CPU.*iou-wrk-' [Mon Apr 15 18:02:55 2024] CPU: 0 PID: 4628 Comm: iou-wrk-4605 Not tainted 6.9.0-rc3 #2 [Mon Apr 15 18:02:55 2024] CPU: 1 PID: 4635 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2 [Mon Apr 15 18:02:56 2024] CPU: 2 PID: 4636 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2 [Mon Apr 15 18:02:57 2024] CPU: 3 PID: 4632 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2 [Mon Apr 15 18:02:57 2024] CPU: 2 PID: 4634 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2 [Mon Apr 15 18:02:58 2024] CPU: 5 PID: 4626 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2 [Mon Apr 15 18:02:58 2024] CPU: 7 PID: 4630 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2 [Mon Apr 15 18:02:58 2024] CPU: 4 PID: 4707 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2 [Mon Apr 15 18:02:58 2024] CPU: 1 PID: 4705 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2 [Mon Apr 15 18:02:58 2024] CPU: 0 PID: 4711 Comm: iou-wrk-4605 Tainted: G D 6.9.0-rc3 #2

The alloc_fd lines in the next section were changed to the following.

 ? alloc_fd (... fs/file.c:553 (discriminator 10))

The 10 backtraces are very similar.

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-1.txt
16a17
>  ? __send_ipi_one (arch/x86/hyperv/hv_apic.c:252 (discriminator 3))

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-2.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-3.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-4.txt
14a15
>  ? update_cfs_group (kernel/sched/fair.c:3983)

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-5.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-6.txt
16a17
>  ? __io_req_complete_post (io_uring/io_uring.c:1016)

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-7.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-8.txt
16a17
>  ? __send_ipi_one (arch/x86/hyperv/hv_apic.c:252 (discriminator 3))

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-9.txt
14a15
>  ? update_cfs_group (kernel/sched/fair.c:3983)




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux