[Bug 15051] New: EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op

samba-bugs@xxxxxxxxx · Fri, 22 Apr 2022 15:05:29 +0000

https://bugzilla.samba.org/show_bug.cgi?id=15051

            Bug ID: 15051
           Summary: EBADF/EIO errors in rename/open caused by race
                    condition in smb2_compound_op
           Product: CifsVFS
           Version: 5.x
          Hardware: x64
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: kernel fs
          Assignee: sfrench@xxxxxxxxx
          Reporter: ohubsch@xxxxxxxxxxxxxxx
        QA Contact: cifs-qa@xxxxxxxxx
  Target Milestone: ---

There are two different scenarios where we encountered the EBADF/EIO errors: 

- Write test, where there are multiple threads writing to a file with "_temp"
suffix and after the write completes a rename happens that removes the "_temp"
suffix. Each process/thread writes to a unique file, there are no overlaps.
Read test, where multiple threads attempt to choose a random file out of a
certain set and read it – so here it can happen that multiple threads access
the same file at the same time. The error happens most often when the set of
files is small; for example, repeated reads of 5 different 20M files by 128
threads triggered this for us.
- Read test, where multiple threads attempt to choose a random file out of a
certain set and read it – so here it can happen that multiple threads access
the same file at the same time. The error happens most often when the set of
files is small; for example, repeated reads of 5 different 20M files by 128
threads triggered this for us.

We believe the root cause for both of them is the same: there seems to be a
race condition in smb2_compound_op:

---
after_close:
        num_rqst++;

        if (cfile) {
                cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server
                cfile = NULL;
---

This code is triggered by smb2_query_path_info operation that happens during
revalidate_dentry. In smb2_query_path_info, get_readable_path is called to load
the cfile, increasing the reference counter. If in the meantime, this reference
becomes the very last, this call to cifsFileInfo_put(cfile) will trigger a
SMB2_CLOSE request sent to the server just before sending this compound request
– and so then the compound request fails either with EBADF/EIO depending on the
timing at the server, because the handle is already closed.

In the first scenario, the race seems to be happening between
smb2_query_path_info triggered by the rename operation, and between “cleanup”
of asynchronous writes – while fsync(fd) likely waits for the asynchronous
writes to complete, releasing the writeback structures can happen after the
close(fd) call. So the EBADF/EIO errors will pop up if the timing is such that:
1) There are still outstanding references after close(fd) in the writeback
structures
2) smb2_query_path_info successfully fetches the cfile, increasing the
refcounter by 1
3) All writeback structures release the same cfile, reducing refcounter to 1
4) smb2_compound_op is called with that cfile

In the second scenario, the race seems to be similar – here open triggers the
smb2_query_path_info operation, and if all other threads in the meantime
decrease the refcounter to 1 similarly to the first scenario, again SMB2_CLOSE
will be sent to the server just before issuing the compound request. This case
is harder to reproduce across different machines and the timing is more tricky,
so we attach a pseudocode for that case.

Proposed patch:

It seems to us that it should be sufficient and harmless to just remove the two
lines:

cifsFileInfo_put(cfile);
cfile = NULL;

The cleanup happens anyways just below the call to compound_send_recv under the
finished: label. Unless something in compound_send_recv can interfere with this
cleanup, it should be safe to remove them. In our testing we didn't encounter
any issues after removing them.

Observed stack traces by manual insertion of dump_stack to
_cifsFileInfo_put/smb2_compound_op:
rite test, from _cifsFileInfo_put after hitting last reference:

[Tue Apr 19 05:38:30 2022] Call Trace:
[Tue Apr 19 05:38:30 2022]  dump_stack_lvl+0x33/0x42
[Tue Apr 19 05:38:30 2022]  _cifsFileInfo_put+0x124/0x3b1 [cifs]
[Tue Apr 19 05:38:30 2022]  ? smb2_plain_req_init+0x43/0x45 [cifs]
[Tue Apr 19 05:38:30 2022]  ? SMB2_query_info_init+0x5c/0xf5 [cifs]
[Tue Apr 19 05:38:30 2022]  cifsFileInfo_put+0x14/0x15 [cifs]
[Tue Apr 19 05:38:30 2022]  smb2_compound_op+0xd78/0xfc0 [cifs]
[Tue Apr 19 05:38:30 2022]  smb2_query_path_info+0x14e/0x238 [cifs]
[Tue Apr 19 05:38:30 2022]  cifs_get_inode_info+0x1f5/0x6fb [cifs]
[Tue Apr 19 05:38:30 2022]  cifs_revalidate_dentry_attr+0x249/0x2e7 [cifs]
[Tue Apr 19 05:38:30 2022]  cifs_revalidate_dentry+0x1a/0x2b [cifs]
[Tue Apr 19 05:38:30 2022]  cifs_d_revalidate+0x6b/0x15c [cifs]
[Tue Apr 19 05:38:30 2022]  lookup_dcache+0x3b/0x60
[Tue Apr 19 05:38:30 2022]  __lookup_hash+0x1f/0xa0
[Tue Apr 19 05:38:30 2022]  ? down_write+0xe/0x40
[Tue Apr 19 05:38:30 2022]  do_renameat2+0x279/0x510
[Tue Apr 19 05:38:30 2022]  ? strncpy_from_user+0x41/0x1a0
[Tue Apr 19 05:38:30 2022]  __x64_sys_rename+0x3c/0x50
[Tue Apr 19 05:38:30 2022]  do_syscall_64+0x3a/0x80
[Tue Apr 19 05:38:30 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Read test, from smb2_compound_op after hitting rc = -9:

Call Trace:
[Wed Apr 20 02:31:10 2022]  dump_stack_lvl+0x33/0x42
[Wed Apr 20 02:31:10 2022]  smb2_compound_op+0x902/0x103b [cifs]
[Wed Apr 20 02:31:10 2022]  ? build_path_from_dentry_optional_prefix+0xa0/0x298
[cifs]
[Wed Apr 20 02:31:10 2022]  ? cifsFileInfo_get+0x29/0x2f [cifs]
[Wed Apr 20 02:31:10 2022]  ? ___ratelimit+0x6f/0xd0
[Wed Apr 20 02:31:10 2022]  smb2_query_path_info+0x14c/0x248 [cifs]
[Wed Apr 20 02:31:10 2022]  cifs_get_inode_info+0x1f5/0x6fb [cifs]
[Wed Apr 20 02:31:10 2022]  cifs_revalidate_dentry_attr+0x249/0x2e7 [cifs]
[Wed Apr 20 02:31:10 2022]  cifs_revalidate_dentry+0x1a/0x2b [cifs]
[Wed Apr 20 02:31:10 2022]  cifs_d_revalidate+0x6b/0x15c [cifs]
[Wed Apr 20 02:31:10 2022]  lookup_fast+0xcd/0x150
[Wed Apr 20 02:31:10 2022]  path_openat+0x114/0x1050
[Wed Apr 20 02:31:10 2022]  ? cifsFileInfo_put_final+0xc0/0xc9 [cifs]
[Wed Apr 20 02:31:10 2022]  ? _cifsFileInfo_put+0x391/0x39b [cifs]
[Wed Apr 20 02:31:10 2022]  do_filp_open+0xb4/0x120
[Wed Apr 20 02:31:10 2022]  ? __check_object_size+0x15f/0x170
[Wed Apr 20 02:31:10 2022]  do_sys_openat2+0x242/0x300
[Wed Apr 20 02:31:10 2022]  do_sys_open+0x4b/0x80
[Wed Apr 20 02:31:10 2022]  do_syscall_64+0x3a/0x80
[Wed Apr 20 02:31:10 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Write-case small repro:
#include <iostream>
#include <thread>
#include <vector>
#include <string>
#include <unistd.h>
#include <fcntl.h>

std::string path1 = "/mnt/mountpoint/dir";
const int FILES_PER_WORKER = 1000;
const int BUF_SIZE = 1048576 * 20;
const int BUF_NUMBER_OF_WRITES = 1;
const int WORKERS = 4;

int main() {
        std::vector<std::thread> ts;
        for (int workerNum = 0; workerNum < WORKERS; workerNum++) {
                ts.emplace_back([workerNum](){
                        std::unique_ptr<char[]> buf(new char[BUF_SIZE]);
                        for (int i = 0; i < BUF_SIZE; i++)
                                buf[i] = 'A';
                        for (int i = 0; i < FILES_PER_WORKER; i++) {
                                std::string finalPath = path1 +
std::to_string(FILES_PER_WORKER * workerNum + i);
                                std::string tempPath = finalPath + "_temp";
                                int fd = open(tempPath.c_str(), O_RDWR |
O_CREAT | O_TRUNC);
                                if (fd < 0) {
                                        std::cerr << "Open failed!" <<
std::endl;
                                        return;
                                }

                                for (int j = 0; j < BUF_NUMBER_OF_WRITES; j++)
{
                                        int res = write(fd, buf.get(),
BUF_SIZE);
                                        if (res != BUF_SIZE) {
                                                std::cerr << "Write error!" <<
std::endl;
                                                return;
                                        }
                                }
                                int res = fsync(fd);
                                if (res != 0) {
                                        std::cerr << "Fsync error!" <<
std::endl;
                                        return;
                                }
                                res = close(fd);
                                if (res != 0) {
                                        std::cerr << "Close error!" <<
std::endl;
                                        return;
                                }
                                res = rename(tempPath.c_str(),
finalPath.c_str());
                                if (res != 0) {
                                        std::cerr << "Rename error! (" << errno
<< ")" << std::endl;
                                        return;
                                }
                        }

                });
        }
        for(auto& t : ts)
                t.join();

        return 0;
}

For the read test, the following happens on each thread instead (on the files
created by the write test).

for (int j = 0; j < TOTAL_READS; j++)
{
    usleep(random time);
    string path = basePath + to_string(random number from 0 to 4 inclusive);
    int fd = open(path, O_RDONLY);
    read(fd, buf, 20M);
    close(fd);
}

-- 
You are receiving this mail because:
You are the QA Contact for the bug.