File Read Returns Non-existent Null Bytes

Chris Perl <cperl@xxxxxxxxxxxxxx> · Mon, 23 Feb 2015 15:56:17 -0500

I'm currently experiencing an issue where reading from a file that is
known to only contain ascii characters can result in the process
reading that file seeing NULL bytes (0x0) in the file that never
existed.  This happens for NFSv3 clients when the file is being
updated concurrently either by another NFSv3 client or directly on the
NFS server.

Its possible this issue is already known as I've found the following
RHEL knowledge base article talking about it (unfortunately you need
to be logged in to view the full content of the following link):

https://access.redhat.com/solutions/22717

If this is already a known issue and there is relevant discussion of
it, I would appreciate it if someone could point me to it as I
couldn't find anything (but, its possible I just missed it).

If it isn't a known issue, here are the details of how I've been reproducing:

I have reproduced this on RHEL 6.5 (2.6.32.431.1.2.0.1.el6) as well as
on Fedora 21 (kernel-3.18.7-200.fc21).

In the RHEL 6.5 case, I am using two systems with the aforementioned
kernel and an Isilon as the NFS server.  In the Fedora 21 case, I am
using just one system, acting as both client and server.

The basic idea is to touch something like 10 files, then start the
attached python program (tail.py) on them "tailing" them as they grow
and finally start updating them either from another NFS client or from
the server directly.  Here is the general set of steps for reproducing
using a single system:

# mkdir -p /export/test
# mkdir -p /mnt/test
# echo '/export/test localhost(rw)' > /etc/exports
# exportfs -fa (make sure nfsd is running, etc)
# mount -t nfs -o vers=3 localhost:/export/test /mnt/test

(start the tailers)
# touch /mnt/test/file{00..09}
# for f in /mnt/test/file{00..09}; do python tail.py ${f} & done &&
cat >/dev/null

(start the updating)
# for f in /export/test/file{00..09}; do while true; do echo 'abc' >>
${f}; sleep .1; done & done

After a little while, you'll see output like the following from the
"tailer" processes:

file08: ['61', '62', '63', '0A', '61', '62', '63', '0A', '61', '62',
'63', '0A', '61', '62', '63', '0A', '00', '00', '00', '00']
file04: ['61', '62', '63', '0A', '61', '62', '63', '0A', '61', '62',
'63', '0A', '61', '62', '63', '0A', '00', '00', '00', '00']
file02: ['61', '62', '63', '0A', '61', '62', '63', '0A', '61', '62',
'63', '0A', '61', '62', '63', '0A', '00', '00', '00', '00']

As you can see, there are NULL bytes at the end that never existed.

I've done some analysis of this using SystemTap and am happy to share
details if necessary.  I believe the issue is due to concurrent
execution of `do_generic_file_read' (by the process making the read
syscall) and `nfs_update_inode' by rpciod.  rpciod updates the inode
size, but before it can invalidate the page cache, our process making
the read syscall has already found the the page and is now using the
vfs inode's size to determine how much data to copy out of the page
back into userspace.

It seems like since the VFS layer doesn't do any locking of the inode
(or at least `do_generic_file_read' doesn't), then the fact that
rpciod is updating the inode under the protection of `i_lock' doesn't
help.  My naive thought was to simply mark the page cache invalid
before updating the inode's size in `nfs_update_inode', but I'm unsure
if that would have other unwanted effects.

Any thoughts or insights appreciated.
#!/usr/bin/env python

import os
import sys
import stat
import time

class Tailer(object):
    def __init__(self, path, stat_delay=.5, bufsize=4096):
        self.path = path
        self.stat_delay = stat_delay
        self.bufsize = bufsize
        self.pos = 0

    def size(self):
        with open(self.path, 'r') as f:
            fd = f.fileno()
            return os.fstat(fd)[stat.ST_SIZE]

    def read(self):
        with open(self.path, 'r') as f:
            fd = f.fileno()
            os.lseek(fd, self.pos, os.SEEK_SET)
            data = ""
            while True:
                tmp = os.read(fd, self.bufsize)
                if tmp == "":
                    break
                else:
                    data += tmp
                    self.pos += len(tmp)
        return data

    def run(self):
        while True:
            file_size = self.size()
            if file_size > self.pos:
                data = self.read()
                if "\x00" in data:
                    print "%s: %r" % (self.path, ["%02X" % ord(x) for x in data])
                    break
            time.sleep(self.stat_delay)

def main():
    tailer = Tailer(sys.argv[1], bufsize=32768)
    tailer.run()

if __name__ == "__main__":
    main()