Intermittent data corruption probably in Linux 6.6.52 SMB client

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have found an intermittent but reliably reproducible data corruption
bug, most probably in the Linux SMB/CIFS kernel client.  It is triggered
by repeating this sequence of operations:

    int fd = open("/smb/server/testfile",
        O_WRONLY|O_CREAT|O_EXCL|O_CLOEXEC, 0666);
    write(fd, buffer1, len1);
    write(fd, buffer2, len2);
    write(fd, buffer3, len3);
    fdatasync(fd);
    posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
    close(fd);

where `len1`, `len2`, and `len3` are each at least tens of kilobytes
and none of them is a multiple of 512.  When the bug happens, a few
kilobytes of data from the *beginning* of the second write will be
replaced with binary zeroes.  The run of zeroes consistently *ends*
at a multiple of 512.

The bug happens on order of 10% of iterations of the above sequence.
If the writes are too large *or* too small, it happens less often; the
sweet spot for maximum reproducibility seems to be a total file size
of about 1,000,000 bytes.  Omitting the calls to fdatasync and/or
posix_fadvise also makes the corruption happen less often, but it
does still happen.

The reproducer sequence may seem a little odd, but it’s cut down
from the actual behavior of a real application, namely version
1.4.0 of the “borg” backup utility (<https://www.borgbackup.org/>).

The SMB server I’m testing with runs FreeBSD 13 and Samba 4.15.13.
The backing filesystem for the share is ZFS.  Server and client
machines are on the same network segment.  I have not ruled out
the possibility of a server-side bug, but the symptoms seem more
likely to indicate a client bug to me.

I tried enabling debug mode in the SMB client
(echo 7 > /proc/fs/cifs/cifsFYI) but this produced nothing useful
in dmesg, only “CIFS: Status code returned 0x80000006 STATUS_NO_MORE_FILES”
from the directory scan done by my reproducer program.

A self-contained C program that reproduces the bug is attached to this
message.  It has options to control the file size, directories used
for testing, and whether to call fdatasync and posix_fadvise.  When it
detects the bug it will record the difference between the expected and
actual contents of each corrupted file. If you want to try it with other
clients besides the Linux kernel implementation, be advised it does use
some GNU library extensions (asprintf, argp, dirent type macros).

I’m happy to do further testing if that would help.  Please note
however that I use strictly monolithic kernels (CONFIG_MODULES=n)
with ftrace configured out.  Please cc me directly on any replies,
I’m not subscribed to linux-cifs.

zw

# System information

`uname -rsm`: Linux 6.6.52-gentoo x86_64

mount.cifs version: 7.0

/proc/mounts options list for the share:

    rw,nosuid,nodev,relatime,vers=3.1.1,cache=strict,username=storage,
    uid=0,noforceuid,gid=0,noforcegid,addr=10.0.0.247,file_mode=0644,
    dir_mode=0755,soft,nounix,setuids,serverino,mapposix,noperm,
    reparse=nfs,rsize=4194304,wsize=4194304,bsize=1048576,retrans=1,
    echo_interval=60,actimeo=1,closetimeo=1

/proc/fs/cifs/DebugData:

CIFS Version 2.45
Features:,STATS,XATTR,ACL
CIFSMaxBufSize: 16384
Active VFS Requests: 0

Servers:
1) ConnectionId: 0x1 Hostname: [redacted]
ClientGUID: [redacted]
Number of credits: 8190,1,1 Dialect 0x311
Server capabilities: 0x300047
TCP status: 1 Instance: 1
Local Users To Server: 1 SecMode: 0x1 Req On Wire: 0 Net namespace: 4026531840
In Send: 0 In MaxReq Wait: 0
Compression: disabled on mount

    Sessions:
    1) Address: [redacted] Uses: 1 Capability: 0x300047 Session Status: 1
    Security type: RawNTLMSSP  SessionId: 0x5aa5ebec
    User: 0 Cred User: 0

    Shares:
    0) IPC: \\[redacted]\IPC$ Mounts: 1 DevInfo: 0x0 Attributes: 0x0
    PathComponentMax: 0 Status: 1 type: 0 Serial Number: 0x0
    Share Capabilities: None    Share Flags: 0x0
    tid: 0x90dfa700 Maximal Access: 0x1f00a9

    1) \\[redacted]\backup Mounts: 1 DevInfo: 0x20 Attributes: 0x5002f
    PathComponentMax: 255 Status: 1 type: DISK Serial Number: 0x5e804e9e
    Share Capabilities: None Aligned, Partition Aligned,    Share Flags: 0x0
    tid: 0x693853c0 Optimal sector size: 0x200  Maximal Access: 0x1f01ff


    MIDs:
--

/proc/fs/cifs/Stats after several test cycles:

Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0

0 session 0 share reconnects
Total vfs operations: 67762 maximum at one time: 31

Max requests in flight: 365
1) \\[redacted]\backup
SMBs: 80060 since 2024-09-26 13:43:48 UTC
Bytes read: 37424899280  Bytes written: 37561813928
Open files: 0 total (local), 0 open on server
TreeConnects: 434 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 8704 total 1 failed
Closes: 8703 total 0 failed
Flushes: 1601 total 0 failed
Reads: 10181 total 0 failed
Writes: 47345 total 0 failed
Locks: 0 total 0 failed
IOCTLs: 1 total 1 failed
QueryDirectories: 124 total 0 failed
ChangeNotifies: 0 total 0 failed
QueryInfos: 2966 total 0 failed
SetInfos: 1 total 0 failed
OplockBreaks: 433 sent 0 failed
/* Self-contained reproducer for a SMB/CIFS data corruption bug apparently
   due to a race in the handling of POSIX_FADV_DONTNEED and/or fdatasync.  */

#define _GNU_SOURCE 1

#include <argp.h>
#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <unistd.h>

/** splitmix64 random number generator.
   This is a fixed-increment version of Java 8's SplittableRandom generator.
   See http://dx.doi.org/10.1145/2714064.2660195 and
   http://docs.oracle.com/javase/8/docs/api/java/util/SplittableRandom.html
   Code taken almost verbatim from https://prng.di.unimi.it/splitmix64.c

   Used here because it's very small, produces acceptable quality
   randomness for this application, and has no constraints on the
   seed value.  */

static uint64_t splitmix64_next(uint64_t *state)
{
    uint64_t z = (*state += 0x9e3779b97f4a7c15);
    z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
    z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
    return z ^ (z >> 31);
}

/** Fill the buffer BUF (of length LEN) with random bytes, all of which
    are guaranteed to be printable ASCII.  STATE points to a state vector
    for the random number generator.  */
static void fill_buf(char *buf, size_t len, uint64_t *state)
{
    /* not quite the usual base64 alphabet */
    static const char encoding[64] =
        "\n.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

    /* 24 bytes of binary data make 32 bytes of base64 data */
    union {
        uint64_t values[3];
        unsigned char bits[3 * sizeof(uint64_t)];
    } data;

    char *wptr = buf;
    char *wlimit = buf + len;
    unsigned char *rptr;
    unsigned char *rlimit = data.bits + sizeof data.bits;
    uint_least32_t v;

    while (wptr < wlimit) {
        data.values[0] = splitmix64_next(state);
        data.values[1] = splitmix64_next(state);
        data.values[2] = splitmix64_next(state);

        rptr = data.bits;
        while (rptr < rlimit) {
            /* on entry to this loop we are guaranteed that there is
               at least one byte remaining to write and rlimit - rptr
               is a multiple of three */
            v = *rptr++;
            v = *rptr++ | (v << 8);
            v = *rptr++ | (v << 8);

            *wptr++ = encoding[v & 0x3F];
            v >>= 6;
            if (wptr == wlimit) break;
            *wptr++ = encoding[v & 0x3F];
            v >>= 6;
            if (wptr == wlimit) break;
            *wptr++ = encoding[v & 0x3F];
            v >>= 6;
            if (wptr == wlimit) break;
            *wptr++ = encoding[v & 0x3F];
            v >>= 6;
            if (wptr == wlimit) break;
        }
    }
}

/** Get a handle to a directory named NAME, creating it if necessary.
    Does not attempt to create missing parent directories.  */
static int open_or_create_dir(const char *name)
{
    int fd = open(name, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
    if (fd >= 0)
        return fd;

    if (errno != ENOENT)
        goto fail;

    /* try creating it */
    if (mkdir(name, 0777))
        goto fail;

    /* now try opening it again */
    fd = open(name, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
    if (fd >= 0)
        return fd;

 fail:
    perror(name);
    return -1;
}


/** Write all LEN bytes of BUF to file descriptor FD.  */
static int write_all(int fd, const char *buf, size_t len)
{
    while (len > 0) {
        ssize_t nwritten = write(fd, buf, len);
        if (nwritten == 0)
            abort(); /* should never happen when writing to a file */
        if (nwritten < 0)
            return -1;
        buf += (size_t)nwritten;
        len -= (size_t)nwritten;
    }
    return 0;
}

/** Read exactly LEN bytes from file descriptor FD to BUF.  */
static int read_all(int fd, char *buf, size_t len)
{
    while (len > 0) {
        ssize_t nread = read(fd, buf, len);
        if (nread == 0)
            abort(); /* should never happen when reading from a file */
        if (nread < 0)
            return -1;
        buf += (size_t)nread;
        len -= (size_t)nread;
    }
    return 0;
}

/** Write a hex dump of DATA (of length LEN) to a new file named
    NAME, in directory DFD.  */
static int write_hexdump(int dfd, const char *name, const char *contents,
                         size_t len)
{
    int fd = openat(dfd, name, O_WRONLY | O_CREAT | O_EXCL, 0666);
    if (fd < 0) {
        fprintf(stderr, "creating %s: %s\n", name, strerror(errno));
        return -1;
    }
    FILE *fp = fdopen(fd, "w");
    if (!fp) {
      perror("fdopen");
      return -1;
    }
    for (size_t i = 0; i < len; i++) {
        if (i % 16 == 0) {
            if (i != 0)
                fputc('\n', fp);
            fprintf(fp, "%08zx: ", i);
        }
        if (i % 16 == 8)
            fputc(' ', fp);
        fprintf(fp, " %02x", (unsigned int)(unsigned char)contents[i]);
    }
    fputc('\n', fp);
    if (ferror(fp) || fflush(fp)) {
        fprintf(stderr, "%s: write error: %s\n", name, strerror(errno));
        fclose(fp);
        return -1;
    }
    if (fclose(fp)) {
        fprintf(stderr, "%s: write error: %s\n", name, strerror(errno));
        return -1;
    }
    return 0;
}

/** Create a file of size LEN, in the directory referred to by DFD,
    containing randomly chosen but predictable data.  This is the
    SEQ'th file to be created. The seed for the random number
    generator can be recovered from the file name.  Write the file's
    contents using a sequence of operations that was observed to
    trigger data corruption when the file is stored on a SMB/CIFS file
    system (potentially also elsewhere).  */
static int create_one_file_provoking_bug(int dfd, uint32_t seq, uint32_t len,
                                         bool do_fdatasync, bool do_fadvise)
{
    uint64_t state = ((uint64_t)seq) << 32 | len;

    char *name = 0;
    if (asprintf(&name, "sfab-%08x-%08x", seq, len) < 0) {
        perror("malloc");
        return -1;
    }
    char *contents = malloc(len);
    if (!contents) {
        perror("malloc");
        free(name);
        return -1;
    }

    fill_buf(contents, len, &state);

    /* Here is the troublesome sequence of operations.  We divide the
       file into three chunks, comprising roughly 40%, 58%, and 2% of
       the data, with both split points being slightly before a 512-byte
       boundary.  Each chunk is written separately.  Then we call
       fdatasync(), posix_fadvise(POSIX_FADV_DONTNEED), and close()
       on the file, in that order.  When the bug triggers, all operations
       report success but some of the data from the beginning of the
       second write is replaced with binary zeroes. */

    uint64_t offsetbits = splitmix64_next(&state);
    size_t chunksize_1 = (size_t)(len * 0.40) & ~(size_t)0x1FF;
    size_t chunksize_2 = (size_t)(len * 0.58) & ~(size_t)0x1FF;

    chunksize_1 -= (offsetbits >> 11) & 0x1F;
    chunksize_2 -= (offsetbits >> 17) & 0x1F;

    size_t chunksize_3 = (size_t)len - (chunksize_1 + chunksize_2);

    if (chunksize_1 > (size_t)len
        || chunksize_2 > (size_t)len
        || chunksize_3 > (size_t)len) {
      fprintf(stderr, "file size %u is too small to be split into chunks\n",
              len);
      exit(1);
    }

    int fd = openat(dfd, name, O_WRONLY | O_CREAT | O_EXCL | O_CLOEXEC, 0666);
    if (fd < 0) {
        fprintf(stderr, "creating %s: %s\n", name, strerror(errno));
        free(contents);
        free(name);
        return -1;
    }

    int err = write_all(fd, contents, chunksize_1);
    if (err) goto out;
    err = write_all(fd, contents + chunksize_1, chunksize_2);
    if (err) goto out;
    err = write_all(fd, contents + chunksize_1 + chunksize_2, chunksize_3);
    if (err) goto out;
    if (do_fdatasync) {
        err = fdatasync(fd);
        if (err) goto out;
    }
    if (do_fadvise) {
        err = posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
    }
 out:;
    int errcode = errno;
    if (close(fd)) {
        err = -1;
        if (!errcode)
            errcode = errno;
    }
    if (err == -1)
        fprintf(stderr, "filling %s: %s\n", name, strerror(errcode));
    free(contents);
    free(name);
    return err;
}

/** Verify the contents of a file created by create_one_file_provoking_bug.
    The return value is 0 if the file is correct, 1 if it is the wrong size
    or has the wrong contents, and -1 if a system error occurred; in all
    failure cases an error is printed.  */
static int verify_one_file(int dfd, const char *name, int log_dfd)
{
    int fd = openat(dfd, name, O_RDONLY | O_CLOEXEC);
    if (fd < 0) {
        /* the d_type filter in main isn't guaranteed to work,
           we might have tried to open a directory */
        if (errno == EISDIR) {
            return 0;
        }
        fprintf(stderr, "reopening %s: %s\n", name, strerror(errno));
        return -1;
    }

    struct stat st;
    if (fstat(fd, &st)) {
        perror("fstat");
        close(fd);
        return -1;
    }
    /* the d_type filter in main isn't guaranteed to work, we might
       have opened a non-file */
    if (!S_ISREG(st.st_mode)) {
        close(fd);
        return 0;
    }

    /* by assumption the name was constructed by this program, so we
       can probably get away with using sscanf to parse values from it */
    uint32_t seq, len;
    if (sscanf(name, "sfab-%x-%x", &seq, &len) != 2) {
        fprintf(stderr, "failed to parse filename '%s'\n", name);
        return -1;
    }

    if (st.st_size != (off_t)len) {
        fprintf(stderr, "%s: unexpected size %lu - should be %u\n",
                name, st.st_size, len);
        close(fd);
        return 1; /* corruption detected */
    }

    char *exp_contents = malloc(len);
    char *act_contents = malloc(len);
    if (!exp_contents || !act_contents) {
        perror("malloc");
        return -1;
    }

    if (read_all(fd, act_contents, len)) {
        fprintf(stderr, "reading %s: %s\n", name, strerror(errno));
        close(fd);
        free(act_contents);
        free(exp_contents);
        return -1;
    }
    close(fd);

    uint64_t state = ((uint64_t)seq) << 32 | len;
    fill_buf(exp_contents, len, &state);

    if (memcmp(exp_contents, act_contents, len) == 0) {
        free(act_contents);
        free(exp_contents);
        return 0;
    }

    fprintf(stderr, "%s: mismatch detected\n", name);

    char *act_name;
    if (asprintf(&act_name, "%s.act", name) < 0) {
        perror("malloc");
        free(act_contents);
        free(exp_contents);
        return -1;
    }

    char *exp_name;
    if (asprintf(&exp_name, "%s.exp", name) < 0) {
        perror("malloc");
        free(act_name);
        free(act_contents);
        free(exp_contents);
        return -1;
    }

    char *diff_name;
    if (asprintf(&diff_name, "%s.diff", name) < 0) {
        perror("malloc");
        free(exp_name);
        free(act_name);
        free(act_contents);
        free(exp_contents);
        return -1;
    }

    if (write_hexdump(log_dfd, exp_name, exp_contents, len)
        || write_hexdump(log_dfd, act_name, act_contents, len)) {
        free(diff_name);
        free(act_name);
        free(exp_name);
        free(act_contents);
        free(exp_contents);
        return -1;
    }
    free(act_contents);
    free(exp_contents);

    fflush(0);

    pid_t child = fork();
    if (child == -1) {
        perror("fork");
        free(diff_name);
        free(act_name);
        free(exp_name);
        return -1;
    }
    if (child == 0) {
        if (fchdir(log_dfd)) {
            perror("fchdir");
            fflush(stderr);
            _exit(126);
        }

        int diff_fd = openat(log_dfd, diff_name,
                             O_WRONLY | O_CREAT | O_EXCL, 0666);
        if (diff_fd == -1) {
            fprintf(stderr, "creating %s: %s\n", diff_name, strerror(errno));
            fflush(stderr);
            _exit(126);
        }
        dup2(diff_fd, 1);
        close(diff_fd);
        execlp("diff", "diff", "-u", exp_name, act_name, (char *)0);
        perror("exec diff");
        fflush(stderr);
        _exit(127);
    }

    free(diff_name);
    free(act_name);
    free(exp_name);

    int status;
    if (waitpid(child, &status, 0) != child) {
        perror("waitpid");
        return -1;
    }
    if (WIFSIGNALED(status)) {
        fprintf(stderr, "diff: %s\n", strsignal(WTERMSIG(status)));
        return -1;
    }
    if (!WIFEXITED(status)) {
        fprintf(stderr, "diff: bizarre wait status %04x\n", status);
        return -1;
    }
    /* because the two files are known to be different, diff should've
       exited with status 1 */
    if (WEXITSTATUS(status) != 1) {
        fprintf(stderr, "diff: unexpected exit status %d\n",
                WEXITSTATUS(status));
        return -1;
    }
    return 1; /* corruption detected */
}

struct cmdline_args
{
    const char *test_dir;
    const char *log_dir;
    uint32_t n_bytes;
    uint32_t n_files;
    bool do_fdatasync;
    bool do_fadvise;
};

static uint32_t cmdline_parse_u32(const char *arg, int key,
                                  struct argp_state *state)
{
    unsigned long val;
    char *endp;
    errno = 0;
    val = strtoul(arg, &endp, 10);
    if (endp == arg || *endp != '\0') {
        argp_error(
            state,
            "invalid argument to -%c: '%s' (must be a decimal integer)",
            key,
            arg
        );
        return 0;
    }
    if (errno || val > (unsigned long)UINT32_MAX) {
        argp_error(
            state,
            "invalid argument to -%c: '%s' (too large, 32 bits max)",
            key,
            arg
        );
        return 0;
    }
    if (val == 0) {
        argp_error(
            state,
            "invalid argument to -%c: '%s' (must be nonzero)",
            key,
            arg
        );
        return 0;
    }
    return (uint32_t) val;
}

static error_t cmdline_parser(int key, char *arg, struct argp_state *state)
{
    struct cmdline_args *args = (struct cmdline_args *)state->input;
    switch (key) {
    case 's':
        args->n_bytes = cmdline_parse_u32(arg, key, state);
        return 0;

    case 'c':
        args->n_files = cmdline_parse_u32(arg, key, state);
        return 0;

    case 'D':
        args->do_fdatasync = false;
        return 0;

    case 'A':
        args->do_fadvise = false;
        return 0;

    case ARGP_KEY_ARG:
        if (state->arg_num == 0) {
            args->test_dir = arg;
            return 0;
        } else if (state->arg_num == 1) {
            args->log_dir = arg;
            return 0;
        } else {
            argp_error(state, "too many positional arguments");
            return ARGP_ERR_UNKNOWN;
        }

    case ARGP_KEY_SUCCESS:
        if (!args->test_dir) {
            argp_error(state, "TEST-DIR argument is required");
            return EINVAL;
        }
        return 0;

    default:
        return ARGP_ERR_UNKNOWN;
    }
    /* not reached */
}

static const struct argp_option cmdline_options[] = {
    { "size", 's', "KBYTES", 0,
      "Size of each test file in bytes (default: 1,000,000)", 1 },
    { "count", 'c', "N-FILES", 0,
      "Number of test files to create (default: 100)", 1 },
    { "no-fdatasync", 'D', 0, 0,
      "Don't call fdatasync() on each test file", 2 },
    { "no-fadvise", 'A', 0, 0,
      "Don't call fadvise(DONTNEED) on each test file", 3 },
    { 0, 0, 0, 0, 0, 0 },
};

static const struct argp cmdline_argp = {
    cmdline_options,
    cmdline_parser,
    "test-dir [log-dir]",
    "Attempt to provoke a data corruption bug originally observed with SMB."
    "\v"
    "Test files will be created in TEST-DIR using an unusual sequence of"
    " operations. When the bug is present, some of these files will have"
    " some of their data replaced with binary zeroes. Analysis of each"
    " corrupted file will be written to LOG-DIR, which defaults to the"
    " current directory. TEST-DIR and LOG-DIR should either not exist"
    " or be empty. See the source code for more detail.",
    0, 0, 0,
};

int main(int argc, char **argv)
{
    struct cmdline_args args = {
        NULL,        // no default for test directory
        ".",         // logs written to current directory
        1000 * 1000, // default file size 1 *decimal* MB
        100,         // default create 100 files
        true,        // default do call fdatasync
        true,        // default do call fadvise
    };

    error_t err = argp_parse(&cmdline_argp, argc, argv, 0, 0, &args);
    if (err) {
        fprintf(stderr, "argp_parse: %s\n", strerror(err));
        return 2;
    }

    int test_dfd = open_or_create_dir(args.test_dir);
    if (test_dfd == -1)
        return 2;

    int log_dfd = open_or_create_dir(args.log_dir);
    if (log_dfd == -1)
        return 2;

    for (unsigned int f = 0; f < args.n_files; f++) {
        if (create_one_file_provoking_bug(
            test_dfd,
            f,
            args.n_bytes,
            args.do_fdatasync,
            args.do_fadvise
        ))
            return 2;
    }

    DIR *dp = fdopendir(test_dfd);
    struct dirent *d;
    int any_corrupt = 0;
    while ((d = readdir(dp)) != 0) {
#if defined DT_REG && defined DT_UNKNOWN
        if (d->d_type != DT_REG && d->d_type != DT_UNKNOWN)
            continue;
#endif
        if (strncmp("sfab-", d->d_name, sizeof "sfab-" - 1))
            continue;
        int this_corrupt = verify_one_file(test_dfd, d->d_name, log_dfd);
        if (this_corrupt == -1)
            return 2;
        if (this_corrupt != 0)
            any_corrupt = 1;
    }
    close(log_dfd);
    closedir(dp);
    return any_corrupt;
}

[Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux