On Thu, Oct 27, 2016 at 1:03 AM, Larry Martell <larry.martell@xxxxxxxxx> wrote: > On Wed, Oct 26, 2016 at 9:35 AM, Matt Garman <matthew.garman@xxxxxxxxx> wrote: >> On Tue, Oct 25, 2016 at 7:22 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote: >>> Again, no machine on the internal network that my 2 CentOS hosts are >>> on are connected to the internet. I have no way to download anything., >>> There is an onerous and protracted process to get files into the >>> internal network and I will see if I can get netperf in. >> >> Right, but do you have physical access to those machines? Do you have >> physical access to the machine which on which you use PuTTY to connect >> to those machines? If yes to either question, then you can use >> another system (that does have Internet access) to download the files >> you want, put them on a USB drive (or burn to a CD, etc), and bring >> the USB/CD to the C6/C7/PuTTY machines. > > This site is locked down like no other I have ever seen. You cannot > bring anything into the site - no computers, no media, no phone. You > have to empty your pockets and go through an airport type naked body > scan. > >> There's almost always a technical way to get files on to (or out of) a >> system. :) Now, your company might have *policies* that forbid >> skirting around the technical measures that are in place. > > This is my client's client, and even if I could circumvent their > policy I would not do that. They have a zero tolerance policy and if > you are caught violating it you are banned for life from the company. > And that would not make my client happy. > >> Here's another way you might be able to test network connectivity >> between C6 and C7 without installing new tools: see if both machines >> have "nc" (netcat) installed. I've seen this tool referred to as "the >> swiss army knife of network testing tools", and that is indeed an apt >> description. So if you have that installed, you can hit up the web >> for various examples of its use. It's designed to be easily scripted, >> so you can write your own tests, and in theory implement something >> similar to netperf. >> >> OK, I just thought of another "poor man's" way to at least do some >> sanity testing between C6 and C7: scp. First generate a huge file. >> General rule of thumb is at least 2x the amount of RAM in the C7 host. >> You could create a tarball of /usr, for example (e.g. "tar czvf >> /tmp/bigfile.tar.gz /usr" assuming your /tmp partition is big enough >> to hold this). Then, first do this: "time scp /tmp/bigfile.tar.gz >> localhost:/tmp/bigfile_copy.tar.gz". This will literally make a copy >> of that big file, but will route through most of of the network stack. >> Make a note of how long it took. And also be sure your /tmp partition >> is big enough for two copies of that big file. >> >> Now, repeat that, but instead of copying to localhost, copy to the C6 >> box. Something like: "time scp /tmp/bigfile.tar.gz <IP address of C6 >> host>:/tmp/". Does the time reported differ greatly from when you >> copied to localhost? I would expect them to be reasonably close. >> (And this is another reason why you want a fairly large file, so the >> transfer time is dominated by actual file transfer, rather than the >> overhead.) >> >> Lastly, do the reverse test: log in to the C6 box, and copy the file >> back to C7, e.g. "time scp /tmp/bigfile.tar.gz <IP of C7 >> host>:/tmp/bigfile_copy2.tar.gz". Again, the time should be >> approximately the same for all three transfers. If either or both of >> the latter two copies take dramatically longer than the first, then >> there's a good chance something is askew with the network config >> between C6 and C7. >> >> Oh... all this time I've been jumping to fancy tests. Have you tried >> the simplest form of testing, that is, doing by hand what your scripts >> do automatically? In other words, simply try copying files between C6 >> and C7 using the existing NFS config? Can you manually trigger the >> errors/timeouts you initially posted? Is it when copying lots of >> small files? Or when you copy a single huge file? Any kind of file >> copying "profile" you can determine that consistently triggers the >> error? That could be another clue. > > These are all good debugging techniques, and I have tried some of > them, but I think the issue is load related. There are 50 external > machines ftp-ing to the C7 server, 24/7, thousands of files a day. And > on the C6 client the script that processes them is running > continuously. It will sometimes run for 7 hours then hang, but it has > run for as long as 3 days before hanging. I have never been able to > reproduce the errors/hanging situation manually. > > And again, this is only at this site. We have the same software > deployed at 10 different sites all doing the same thing, and it all > works fine at all of those. Well I spoke too soon. The importer (the one that was initially hanging that I came here to fix) hung up after running 20 hours. There were no NFS errors or messages on neither the client nor the server. When I restarted it, it hung after 1 minute, Restarted it again and it hung after 20 seconds. After that when I restarted it it hung immediately. Still no NFS errors or messages. I tried running the process on the server and it worked fine. So I have to believe this is related to nobarrier. Tomorrow I will try removing that setting, but I am no closer to solving this and I have to leave Japan Saturday :-( The bad disk still has not been replaced - that is supposed to happen tomorrow, but I won't have enough time after that to draw any conclusions. _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos