On Fri, Jun 03, 2011 at 11:59:02AM -0400, Paul Anderson wrote: > On Thu, Jun 2, 2011 at 9:39 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Thu, Jun 02, 2011 at 08:42:47PM -0400, Christoph Hellwig wrote: > >> On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote: > >> > This morning, I had a symptom of a I/O throughput problem in which > >> > dirty pages appeared to be taking a long time to write to disk. > >> > > >> > The system is a large x64 192GiB dell 810 server running 2.6.38.5 from > >> > kernel.org - the basic workload was data intensive - concurrent large > >> > NFS (with high metadata/low filesize), rsync/lftp (with low > >> > metadata/high file size) all working in a 200TiB XFS volume on a > >> > software MD raid0 on top of 7 software MD raid6, each w/18 drives. I > >> > had mounted the filesystem with inode64,largeio,logbufs=8,noatime. > >> > >> A few comments on the setup before trying to analze what's going on in > >> detail. I'd absolutely recommend an external log device for this setup, > >> that is buy another two fast but small disks, or take two existing ones > >> and use a RAID 1 for the external log device. This will speed up > >> anything log intensive, which both NFS, and resync workloads are lot. > >> > >> Second thing if you can split the workloads into multiple volumes if you > >> have two such different workloads, so thay they don't interfear with > >> each other. > >> > >> Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case > >> for almost any type of I/O. You end up doing even relatively small I/O > >> to all of the disks in the worst case. I think you'd be much better > >> off with a simple linear concatenation of the RAID6 devices, even if you > >> can split them into multiple filesystems > >> > >> > The specific symptom was that 'sync' hung, a dpkg command hung > >> > (presumably trying to issue fsync), and experimenting with "killall > >> > -STOP" or "kill -STOP" of the workload jobs didn't let the system > >> > drain I/O enough to finish the sync. I probably did not wait long > >> > enough, however. > >> > >> It really sounds like you're simply killloing the MD setup with a > >> log of log I/O that does to all the devices. > > > > And this is one of the reasons why I originally suggested that > > storage at this scale really should be using hardware RAID with > > large amounts of BBWC to isolate the backend from such problematic > > IO patterns. > > > Dave Chinner > > david@xxxxxxxxxxxxx > > > > Good HW RAID cards are on order - seems to be backordered at least a > few weeks now at CDW. Got the batteries immediately. > > That will give more options for test and deployment. > > Not sure what I can do about the log - man page says xfs_growfs > doesn't implement log moving. I can rebuild the filesystems, but for > the one mentioned in this theread, this will take a long time. Once you have BBWC, the log IO gets aggregated into stripe width writes to the back end (because it is always sequential IO), so it's generally not a significant problem for HW RAID subsystems. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs