On 10/21/2014 06:56 PM, Stan Hoeppner wrote: > > On 10/19/2014 05:24 PM, Dave Chinner wrote: ... >>>> The filesystems were aligned at make time >>>> w/768K stripe width, so each prealloc file should be aligned on >>>> a stripe boundary. >> >> "should be aligned"? You haven't verified they are aligned by using >> with 'xfs_bmap -vp'? > > If I divide the start of the block range by 192 (768k/4k) those files > checked so far return a fractional value. So I assume this means these > files are not stripe aligned. What might cause that given I formatted > with alignment? It seems file alignment isn't a big problem after all as the controllers are doing relatively few small destages, about 1/250th of full stripe destages. And some of these are journal and metadata writes. ... >>> The 350 streams are written to 350 preallocated files in parallel. >> >> And they layout of those files are? If you don't know the physical >> layout of the files and what disks in the storage array they map to, >> then you can't determine what the seek times should be. If you can't >> work out what the seek times should be, then you don't know what the >> stream capacity of the storage should be. Took some time but I worked out a rough map of the files. SG0, SG1, and SG2 are the large, medium, and small file counts respectively. AG SG0 SG1 SG2 AG SG0 SG1 SG2 AG SG0 SG1 SG2 AG SG0 SG1 SG2 0 0 129 520 11 162 132 514 22 160 131 519 33 164 133 522 1 164 129 518 12 160 132 520 23 161 132 518 34 161 130 517 2 164 133 521 13 164 129 522 24 163 131 521 35 162 132 518 3 159 129 518 14 164 130 522 25 162 129 519 36 161 131 518 4 92 257 518 15 163 130 522 26 163 128 520 37 158 131 515 5 91 256 516 16 163 131 521 27 162 130 523 38 0 132 518 6 91 263 519 17 161 130 518 28 161 130 524 39 0 128 523 7 92 261 518 18 165 127 520 29 163 129 517 40 0 131 521 8 91 253 515 19 161 130 517 30 166 129 520 41 0 130 522 9 94 257 451 20 167 128 525 31 162 129 521 42 0 128 517 10 172 129 455 21 164 130 515 32 161 129 515 43 0 131 516 All 3 file sizes are fairly evenly spread across all AGs, and that is a problem. The directory structure is setup so that each group directory has one subdir per stream and multiple files which are written in succession as they fill, and we start with the first file in each directory. SG0 has two streams/subdirs, SG1 has 50, and SG2 has 350. Write stream rates: SG0 2@ 94.0 MB/s, SG1 50@ 2.4 MB/s SG2 350@ 0.14 MB/s This is 357 MB/s aggregate targeted at a 12+1 RAID5 or 12+2 RAID6, the former in this case. In either case we can't maintain this rate. A ~36-45 hour run writes all files once. During this duration we see the controller go into congestion hundreds of times. Wait goes up, bandwidth down, and we drop application buffers because they're on a timer. If we can't write a buffer in X seconds we drop it. The directory/file layout indicates highly variable AG access patterns throughout the run, thus lots of AG-to-AG seeking, thus seeking lots of platter surface all the time. It also indicates large sweeps of the actuators when concurrent file accesses are in low and high numbered AGs. And this tends to explain the relatively stable throughput some of the time with periods of high IO wait and low bandwidth at other times. Too much seek delay with these latter access patterns. I haven't profiled the application to verify which files are written in parallel at a given point in the run, but I think that would be a waste of time given the file/AG distribution we see above. And I don't have enough time left on my contract to do it anyway. I've can attach tree or 'ls -lavR' output if that would help paint a clearer picture of ho the filesystem is organized. Thanks, Stan _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs