On Wed, 31 Dec 2008, Mike McGrath wrote: > Lets pool some knowledge together because at this point, I'm missing > something. > > I've been doing all measurements with sar as bonnie, etc, causes builds to > timeout. > > Problem: We're seeing slower then normal disk IO. At least I think we > are. This is a PERC5/E and MD1000 array. > > When I try to do a normal copy "cp -adv /mnt/koji/packages /tmp/" I get > around 4-6MBytes/s > > When I do a cp of a large file "cp /mnt/koji/out /tmp/" I get > 30-40MBytes/s. > > Then I "dd if=/dev/sde of=/dev/null" I get around 60-70 MBytes/s read. > > If I "cat /dev/sde > /dev/null" I get between 225-300MBytes/s read. > > The above tests are pretty consistent. /dev/sde is a raid5 array, > hardware raid. > > So my question here is, wtf? I've been working to do a backup which I > would think would either cause network utilization to max out, or disk io > to max out. I'm not seeing either. Sar says the disks are 100% utilized > but I can cause major increases in actual disk reads and writes by just > running additional commands. Also, if the disks were 100% utilized I'd > expect we would see lots more iowait. We're not though, iowait on the box > is only %0.06 today. > > So, long story short, we're seeing much better performance when just > reading or writing lots of data (though dd is many times slower then cat). > But with our real-world traffic, we're just seeing crappy crappy IO. > > Thoughts, theories or opinions? Some of the sysadmin noc guys have access > to run diagnostic commands, if you want more info about a setting, let me > know. > > I should also mention there's lots going on with this box, for example its > hardware raid, lvm and I've got xen running on it (though the tests above > were not in a xen guest). > So we all talked about this quite a bit so I felt the need to let everyone know the latest status. One of our goals was to lower utilization on the netapp. While high utilization itself isn't a problem, its just a measurement after all, we did decide other problems could be solved if we could get utilization to go down. So after a bunch of tweaking on the share and in the scripts we run, average utilization has dropped significantly. Take a look here: http://mmcgrath.fedorapeople.org/util.html Thats a latest 30 day view (from a couple days ago). You'll notice it was around 90-100% pretty much all the time. That went on like that for MONTHS. Even christmas day was pretty busy even though that whole period we generally saw low traffic everywhere else in Fedora. Now we're sitting pretty with a 20% utilization average. You'll also notice generally our service time and await are lower. I'm trying to get a bigger view of those numbers over time so we'll see if thats an actual trend or not. The big changers? 1) Better use of the share in our scripts. 2) A larger readahead value (blockdev) Some smaller changes included changing from cfq to deadline (and now noop). In the future there are two things I'd still like to do long term. 1) Move our snapshots to different devices to lower our seeks 2) Full re-index of the filesystem (requiring around 24-36 hours of downtime) but I'm going to schedule this sometime after the Alpha ships. -Mike _______________________________________________ Fedora-infrastructure-list mailing list Fedora-infrastructure-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list