Doll, Margaret Ann wrote: > On Thu, Jun 27, 2013 at 8:54 AM, Miner, Jonathan W (US SSA) < > jonathan.w.miner@xxxxxxxxxxxxxx> wrote: >> >> > I installed the iozone program and ran ./iozone -a. >> >> iozone allows you to benchmark disk performance and gives you objective >> measurements. >> >> > How does this information help me find the offending program? >> >> Not sure you're looking for a "program"... I think you know what program >> is doing the IO on your client machines, > > Users are running gaussian or their own original programs on the compute > nodes. > How does one determine from which node the massive io requests are coming? > > and we know that "nfsd" is doing the IO on the server, and we know from >> your previous output that you have high IO wait times. So... you should >> be looking at which disks are involved, and why the wait times are so >> high.. >> >> Are you using single drives, software raid, hardware raid? What type of >> bus? > > The head node has a single disk for most users' use. A second disk is > owned by a single research group which was not involved in the problem. > > Most of the compute nodes have a single disk. There are two compute nodes > that have a second 700 Gb drive for use with gaussian calculations. The > user that caused the io problem was using one of these compute nodes and > obviously not using the scratch space on the compute node. > I have two truly unpleasant thoughts (and yes, we have at least one person here running Guassians): 1. What *kind* of h/d are they writing to? They're not, say, WD Caviar Green? We find we can't use them in some servers (mostly Penguins, with Supermicro m/b's), because they're "desktop", not "server" drives. The difference is that around '09, all the manufacturers, following WD's lead, took out user control of TLRD (I think is the acronym - it's how long a head tries before giving up, deciding the sector is bad, and writing elsewhere: "desktop" drives will go on for up to 2 min or more(!), while servers give up in 6 or 7 *seconds*). 2. I can document - and I may try again to put file a bugzilla report, this time using our institutional account, rather than just as me - with NFS as implemented in RHEL 6. The issue is that if you're reading and writing to an NFS mounted drive, and (rw,sync), it's approximately SEVEN TIMES SLOWER than the same in RHEL5. You can prove (2) to yourself: on an RHEL 5 server, export a directory, mount it nfs, cd to the mounted one, and untar -xzvf a large file (we've got one with many directories and files, about 28M tar.gz, and that takes about a minute or minute and a half); doing the same on RHEL 6 (or CentOS 6, and it takes 6.5 to 7 MINUTES. The same file, local, takes a second or two. Note that exporting (rw,async) improves it a lot... but when you're running serious scientific computing, and the job may run days, or a week or more, you've got to be concerned..... mark -- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list