Re: nfs-backed mmap file results in 1000s of WRITEs per second

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > Neil Brown posted a patch couple days ago for this!
> > 
> > http://thread.gmane.org/gmane.linux.nfs/58473
> 
> I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> exhibited the same 1000s of WRITEs/sec problem.
> 
> Any other ideas?

Yes. Please try the attached patch.

> > Regards, Malahal.
> > 
> > Quentin Barnes [qbarnes@xxxxxxxxx] wrote:
> > > If two (or more) processes are doing nothing more than writing to
> > > the memory addresses of an mmapped shared file on an NFS mounted
> > > file system, it results in the kernel scribbling WRITEs to the
> > > server as fast as it can (1000s per second) even while no syscalls
> > > are going on.
> > > 
> > > The problems happens on NFS clients mounting NFSv3 or NFSv4.  I've
> > > reproduced this on the 3.11 kernel, and it happens as far back as
> > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18
> > > based).  (All x86_64 systems.)  I didn't try anything in between.
> > > 
> > > I've created a self-contained program below that will demonstrate
> > > the problem (call it "t1").  Assuming /mnt has an NFS file system:
> > > 
> > >   $ t1 /mnt/mynfsfile 1    # Fork 1 writer, kernel behaves normally
> > >   $ t1 /mnt/mynfsfile 2    # Fork 2 writers, kernel goes crazy WRITEing
> > > 
> > > Just run "watch -d nfsstat" in another window while running the two
> > > writer test and watch the WRITE count explode.
> > > 
> > > I don't see anything particularly wrong with what the example code
> > > is doing with its use of mmap.  Is there anything undefined about
> > > the code that would explain this behavior, or is this a NFS bug
> > > that's really lived this long?
> > > 
> > > Quentin
> > > 
> > > 
> > > 
> > > #include <sys/stat.h>
> > > #include <sys/mman.h>
> > > #include <sys/stat.h>
> > > #include <sys/wait.h>
> > > #include <errno.h>
> > > #include <fcntl.h>
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <signal.h>
> > > #include <string.h>
> > > #include <unistd.h>
> > > 
> > > int
> > > kill_children()
> > > {
> > > 	int		cnt = 0;
> > > 	siginfo_t	infop;
> > > 
> > > 	signal(SIGINT, SIG_IGN);
> > > 	kill(0, SIGINT);
> > > 	while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;
> > > 
> > > 	return cnt;
> > > }
> > > 
> > > void
> > > sighandler(int sig)
> > > {
> > > 	printf("Cleaning up all children.\n");
> > > 	int cnt = kill_children();
> > > 	printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");
> > > 
> > > 	exit(0);
> > > }
> > > 
> > > int
> > > do_child(volatile int *iaddr)
> > > {
> > > 	while (1) *iaddr = 1;
> > > }
> > > 
> > > int
> > > main(int argc, char **argv)
> > > {
> > > 	const char	*path;
> > > 	int		fd;
> > > 	ssize_t		wlen;
> > > 	int		*ip;
> > > 	int		fork_count = 1;
> > > 
> > > 	if (argc == 1) {
> > > 		fprintf(stderr, "Usage: %s {filename} [fork_count].\n",
> > > 			argv[0]);
> > > 		return 1;
> > > 	}
> > > 
> > > 	path = argv[1];
> > > 
> > > 	if (argc > 2) {
> > > 		int fc = atoi(argv[2]);
> > > 		if (fc >= 0)
> > > 			fork_count = fc;
> > > 	}
> > > 
> > > 	fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);
> > > 	if (fd < 0) {
> > > 		fprintf(stderr, "Open of '%s' failed: %s (%d)\n",
> > > 			path, strerror(errno), errno);
> > > 		return 1;
> > > 	}
> > > 
> > > 	wlen = write(fd, &(int){0}, sizeof(int));
> > > 	if (wlen != sizeof(int)) {
> > > 		if (wlen < 0)
> > > 			fprintf(stderr, "Write of '%s' failed: %s (%d)\n",
> > > 				path, strerror(errno), errno);
> > > 		else
> > > 			fprintf(stderr, "Short write to '%s'\n", path);
> > > 		return 1;
> > > 	}
> > > 
> > > 	ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,
> > > 			   MAP_SHARED, fd, 0);
> > > 	if (ip == MAP_FAILED) {
> > > 		fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",
> > > 			path, strerror(errno), errno);
> > > 		return 1;
> > > 	}
> > > 
> > > 	signal(SIGINT, sighandler);
> > > 
> > > 	while (fork_count-- > 0) {
> > > 		switch(fork()) {
> > > 		case -1:
> > > 			fprintf(stderr, "Fork failed: %s (%d)\n",
> > > 				strerror(errno), errno);
> > > 			kill_children();
> > > 			return 1;
> > > 		case 0:   /* child  */
> > > 			signal(SIGINT, SIG_DFL);
> > > 			do_child(ip);
> > > 			break;
> > > 		default:  /* parent */
> > > 			break;
> > > 		}
> > > 	}
> > > 
> > > 	printf("Press ^C to terminate test.\n");
> > > 	pause();
> > > 
> > > 	return 0;
> > > }
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> 
> Quentin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com
From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
Date: Thu, 5 Sep 2013 15:52:51 -0400
Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless
 file is locked

If we're doing buffered writes, and there is no file locking involved,
then we don't have to worry about whether or not the lock owner information
is identical.
By relaxing this check, we ensure that fork()ed child processes can write
to a page without having to first sync dirty data that was written
by the parent to disk.

Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
---
 fs/nfs/write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 40979e8..ac1dc33 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -863,7 +863,7 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
 			return 0;
 		l_ctx = req->wb_lock_context;
 		do_flush = req->wb_page != page || req->wb_context != ctx;
-		if (l_ctx) {
+		if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
 			do_flush |= l_ctx->lockowner.l_owner != current->files
 				|| l_ctx->lockowner.l_pid != current->tgid;
 		}
-- 
1.8.3.1


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux