NFS file size anomaly?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[NOTE: cross-posted to linux-nfs and linux-fsdevel]

I'm seeing some unexpected behavior with NFS and file sizes.

The test cases are from the LTP (Linux Test Project), tests
ftest01, ftest05, and ftest07.  I'll concentrate on ftest01
to explain what I've found.

ftest01 fires off 5 subprocesses, each of which opens an empty
file and does the following, repeatedly:

        . lseek to some point in the file
        . read 2048 bytes
        . lseek back to the same point
        . write 2048 bytes

The "point in the file" is determined by a pseudo-random
sequence.  All such points are on 2048-byte boundaries.

Occasionally, also driven pseudo-randomly, ftest01 will throw
in a call to ftruncate(), truncate(), sync(), or fstat().

With the fstat() calls, the returned .st_size is compared
with the test's expected size for the file, and an error is
declared if they don't match.

What's happening is that, some way into the test, this fstat()
check is failing.  Specifically, the .st_size reported by
fstat() is greater than the computed size.

The sequence of operations leading up to this is:

        lseek 1034240 0
        read 2048
        lseek 0 1
        write 2048

        lseek 638976 0
        (read, lseek, write)

        lseek 708608 0
        (read, lseek, write)

        lseek 708608 0
        (read, lseek, write)

        lseek 679584 0
        (read, lseek, write)

        truncate 266240

        lseek 960512 0
        (read, lseek, write)

        (a bunch of lseek/read/lseek/write ops that do not
         extend the file)

        fstat

So the expected size of the file is 960512 + 2048 == 960560.
But the fstat reports a size of 1036288.

A look at what's happening on the wire, distilled from the
output of tethereal, is instructive.

        READ Call 638976 4096 (byte offset and size to read)
        READ Reply 4096 995382 (bytes read and current file size)

        SETATTR Call 266240 (this corresponds to the truncate() call)

        WRITE Call 638976 4096 (byte offset and size to write)
        WRITE Call 708608 4096
        WRITE Call 1032192 4096

        SETATTR Reply 266240 (current size of file)

        WRITE Reply 643072 (current size of file after write)
        WRITE Reply 1036288
        WRITE Reply 1036288

        GETATTR (initiated internally by NFS code?)

        READ Call 958464 4096 READ Reply 4096 1036288

        ... (a bunch of READ and WRITE ops that do not extend the file)

        GETATTR Call (this corresponds to the fstat() call)
        GETATTR Reply 1036288

So what appears to have happened here is that three of the
WRITE operations that the program issued before the truncate()
call have "bled past" the SETATTR, extending the file further
than the SETATTR did.  Since none of the operations issued
after SETATTR extends the file further, by the time we get to
the GETATTR, the file is larger than the test expects.

There are two strange things going on here.  The first,
identified above, is that write()s that were initiated before
the truncate() call are being processed after the resulting
SETATTR call.  The second is that WRITE operations are being
initiated while the SETATTR is outstanding.

It seems to me that a size-changing SETATTR operation should
act essentially as an I/O barrier. It should wait for all outstanding
read/write requests to complete, then issue the SETATTR,
wait for the reply, and only then re-enable read/write requests.

In other words, SETATTR should be atomic with respect to other
I/O operations.

A git bisect indicates that this problem first appeared (or
was first uncovered) with this commit:

    4f8ad65 writeback: Refactor writeback_single_inode()

It continues to the most recent mainline kernels.

NFS v3 vs. v4 doesn't seem to matter.

Has anyone else seen this?  Any pointers you can provide?

Thanks,
Dan Duval
Oracle Corp.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux