On Mon, Oct 24, 2016 at 7:51 AM, mark <m.roth@xxxxxxxxx> wrote: > On 10/24/16 03:52, Larry Martell wrote: >> >> On Fri, Oct 21, 2016 at 11:42 AM, <m.roth@xxxxxxxxx> wrote: >>> >>> Larry Martell wrote: >>>> >>>> On Fri, Oct 21, 2016 at 11:21 AM, <m.roth@xxxxxxxxx> wrote: >>>>> >>>>> Larry Martell wrote: >>>>>> >>>>>> We have 1 system ruining Centos7 that is the NFS server. There are 50 >>>>>> external machines that FTP files to this server fairly continuously. >>>>>> >>>>>> We have another system running Centos6 that mounts the partition the >>>>>> files are FTP-ed to using NFS. > > <snip> >>>>> >>>>> What filesystem? > > <snip> >>> >>> cat /etc/fstab on the systems, and see what they are. If either is xfs, >>> and assuming that the systems are on UPSes, then the fstab which controls >>> drive mounting on a system should have, instead of "defaults", >>> nobarrier,inode64. >> >> >> The server is xfs (the client is nfs). The server does have inode64 >> specified, but not nobarrier. >> >>> Note that the inode64 is relevant if the filesystem is > 2TB. >> >> >> The file system is 51TB. >> >>> The reason I say this is that we we started rolling out CentOS 7, we >>> tried >>> to put one of our user's home directory on one, and it was a disaster. >>> 100% repeatedly, untarring a 100M tarfile onto an nfs-mounted drive took >>> seven minutes, where before, it had taken 30 seconds. Timed. It took us >>> months to discover that NFS 4 tries to make transactions atomic, which is >>> fine if you're worrying about losing power or connectivity. If you're on >>> a >>> UPS, and hardwired, adding the nobarrier immediately brought it down to >>> 40 >>> seconds or so. >> >> >> We are not seeing a performance issue - do you think nobarrier would >> help with our lock up issue? I wanted to try it but my client did not >> want me to make any changes until we got the bad disk replaced. >> Unfortunately that will not happen until Wednesday. > > > Absolutely add nobarrier, and see what happens. Finally got to add nobarrier (I'll skip why it took so long), and it looks like this just caused the problem to morph a bit. On the C7 NFS server, besides having 50 external machines ftp-ing files to it, we run 2 jobs: 1 that moves files around (called image_mover) and one that changes perms on some files (called chmod_job). And on the C6 NFS client, besides the job that was hanging (called the importer), we also run a another job (called ftp_job) that ftps files to the C6 machine. The ftp_job had never hung before, but now the importer that used to hang has not (yet) hung, and the ftp_job that had not hung before now is hanging. But the system messages are different. On the C7 server there is a series of messages of the form 'task blocked for >120 seconds' with a stack trace. There is one for each of the following: nfsd, chmod_job, kworker, pure_ftpd, image_mover In each of the stack traces they are blocked on either nfs_write or nfs_flush And on the C6 client there is a similar blocked message for the ftp job, blocked on nfs_flush, then the bad sequence number message I had seen before, and at that point the ftp_job hung. _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos