[ please word wrap your emails at 68-72 columns ] On Sat, Oct 18, 2014 at 01:16:58PM -0500, Stan Hoeppner wrote: > On 10/18/2014 01:03 AM, Stan Hoeppner wrote: > > On 10/09/2014 04:13 PM, Dave Chinner wrote: > > ... > >>> I'm told we have 800 threads writing to nearly as many files > >>> concurrently on a single XFS on a 12+2 spindle RAID6 LUN. > >>> Achieved data rate is currently ~300 MiB/s. Some of these are > >>> files are supposedly being written at a rate of only 32KiB every > >>> 2-3 seconds, while some (two) are ~50 MiB/s. I need to determine > >>> how many bytes we're writing to each of the low rate files, and > >>> how many files, to figure out RMW mitigation strategies. Out of > >>> the apparent 800 streams 700 are these low data rate suckers, one > >>> stream writing per file. > >>> > >>> Nary a stock RAID controller is going to be able to assemble full > >>> stripes out of these small slow writes. With a 768 KiB stripe > >>> that's what, 24 seconds to fill it at 2 seconds per 32 KiB IO? > >> > >> Raid controllers don't typically have the resources to track > >> hundreds of separate write streams at a time. Most don't have the > >> memory available to track that many active write streams, and those > >> that do probably can't proritise writeback sanely given how slowly > >> most cachelines would be touched. The fast writers would simply tune > >> over the slower writer caches way too quickly. > >> > >> Perhaps you need to change the application to make the slow writers > >> buffer stripe sized writes in memory and flush them 768k at a > >> time... > > > > All buffers are now 768K multiples--6144, 768, 768, and I'm told > > the app should be writing out full buffers. However I'm not > > seeing the throughput increase I should given the amount that > > the RMWs should have decreased, which, if my math is correct, Maybe that's not your problem. What's the storage array tell you about RMW cycles? What's it tell you about lun utilisation - is it even or do you have hot luns? > > should be about half (80) the raw actuator seek rate of these > > drives (7.2k SAS). Not all drives seek at the same rate. Typically for a RAID 6 array, every disk you add to the width of the lun slows the seek rate for full stripe writes by 2-3%. So a 12+2 lun is going to have an average seek rate of 25-30% lower than a 2+1 lun on full stripe writes.... > > Something isn't right. I'm guessing it's > > the controller firmware, maybe the test app, or both. The test > > app backs off then ramps up when response times at the > > controller go up and back down. And it's not super accurate or > > timely about it. The lowest interval setting possible is 10 > > seconds. Which is way too high when a controller goes into > > congestion. The controller should not have any problems with this. If the controller IO response times are varying significantly, then you're doing something wrong - most probably caching in BBWC rather than writing through to disk immediately... > > Does XFS give alignment hints with O_DIRECT writes into > > preallocated files? What do you mean? if the file is preallocated and aligned, then the IO alignment is wholly up to the application. i.e. if the application is not doing aligned IO, then there's nothing the filesystem can do to align it... > > The filesystems were aligned at make time > > w/768K stripe width, so each prealloc file should be aligned on > > a stripe boundary. "should be aligned"? You haven't verified they are aligned by using with 'xfs_bmap -vp'? > > I've played with the various queue settings, > > even tried deadline instead of noop hoping more LBAs could be > > sorted before hitting the controller. Can't seem to get a > > repeatable increase. I've nr_requests at 524288, rq_affinity 2, > > read_ahead_kb 0 since reads are <20% of the IO, add_random 0, > > etc. Nothing seems to help really. nr_requests = 524288? Why do you want to queue half a million IOs once the CTQ depth has overflowed? That's a major latency problem right there. You've got latency problems, so your should be removing any source of potential or variable latency in the IO stack. e.g. turning off all IO scheduler queuing, reducing CTQ depth and using write through caching so you can observe the behaviour of the raw luns. Strip it right back, then observe... > Some additional background: > > Num. Streams = 350 > WRITING: > Num. Write Threads = 100 > Avg. Write Rate = 72 KiB/s > Avg. Write Intvl = 10666.666 ms > Num. Write Buffers = 426 > Write Buffer Size = 768 KiB > Write Buffer Mem. = 327168 KiB > Group Write Rate = 25200 KiB/s > Avg. Buffer Rate = 32.812 bufs/s > Avg. Buffer Intvl. = 30.476 ms > Avg. Thread Intvl. = 3047.600 ms > > The 350 streams are written to 350 preallocated files in parallel. And they layout of those files are? If you don't know the physical layout of the files and what disks in the storage array they map to, then you can't determine what the seek times should be. If you can't work out what the seek times should be, then you don't know what the stream capacity of the storage should be. Keep in mind that single extent files are optimised for read performance, not write performance. i.e. by default XFS trades off some write performance to improve file read performance. Optimising for highest write speeds means linearising all writes (i.e. reducing seeks), while XFS's default behaviour is to separate them into different regions of the disk (increasing seeks). IOWs, write rates are likely to go up if you allow files to be fragmented and interleaved to make writes more sequential. The down side is that reads will then seek, but if reads aren't the primary workload, nor a performance sensitive operation, then perhaps you're optimising for the wrong operation.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs