On 3/12/2014 10:21 AM, Martin T wrote: > Stan, > > you said that "In flight IO size has no correlation to stripe and > chunk size. What you In flight IO is defined as that between DRAM and the HBA ASIC using DMA scatter/gather, and that between the HBA and individual disk devices. The DMA IO size varies widely between HBAs. The largest I've seen is ~320KB. One can determine this using blktrace, though that isn't required for this discussion. The in flight IO size between the HBA and a disk device is variable depending on the technology, whether SAS, ATA, fiber channel, iSCSI, etc. Fiber channel frames are 2112 bytes. The point is that the in flight IO size is significantly smaller than a full stripe width, and smaller than the current default md chunk size of 512KB, or any conceivable chunk size. These IOs are performed by hardware, are transparent to the OS and applications. It should be obvious that you'd never try to align chunks to in flight IO size. This hardware doesn't care. It's the RAID layer, which sites well avoe the hardware, that cares. WRT in flight IO I believe I was responding to someone talking about optimizing the md chunk size to the in flight IO size or similar. It's not quoted in the context and it's not worth my time to track it down. > need to know is how your application(s) write to the filesystem and how > your filesystem issues write IOs.". Could you please explain this? App creates a file with open(2) and writes 4KB every 15 seconds. App creates a file with open(2) and writes 4KB every 1.5 seconds. App creates a file with open(2) and writes 4KB every 0.5 seconds. App creates a file with open(2) and writes 4KB every 0.01 seconds. App creates a file with open(2) and writes 4KB every 0.001 seconds. Assume a stripe width of 8x512KB=4MB. Depending on the filesystem driver, whether EXT3/4, JFS, XFS, the amount of time it will wait to assemble a full aligned stripe from incoming writes will dictate whether it writes a full stripe to the block layer. In the first three cases the filesystem won't align a full stripe because the timer will expire first. Thus you'll get RMW in the RAID layer with parity RAID. In the 2nd to last case, 400KB/s, you'll get full stripe alignment if the FS timer is 10s or more. At 4MB/s you'll always get full stripe aligned writeout. All of this assumes the app is performing only buffered IO. If it issues fsync() or fdatasync, or uses O_DIRECT, depending on when and how it does so, you may get partial stripe writes where you got full stripe writes with buffered IO. > I would think that it's possible to measure how applications read/write > to file system, isn't it? Sure. If it's an allocation workload you simply look at iotop which will tell you the data rate. If it's an append workload, in the case of XFS anyway, this is irrelevant as XFS doesn't do write alignment for non allocation writes. Here full stripe assembly of append data is up the the RAID layer and it's timer. If the application is doing random writes you already know that of this is irrelevant. If you need further information or instruction on application IO profiling you'll need to read one of the books written on the topic, or enroll in one of the many courses offered at various colleges universities. It is simply way beyond the scope of an email discussion. Cheers, Stan > > > regards, > Martin > > > On 3/9/14, Bill Davidsen <davidsen@xxxxxxx> wrote: >> Stan Hoeppner wrote: >>> On 3/7/2014 9:15 PM, Martin T wrote: >>>> Stan, >>>> >>>> ok, I see. However, are there utilities out there which help one to >>>> analyze how applications on a server use the file-system over the time >>>> and help to make an educated decision regarding the chunk size? >>> >>> My apologies. You're a complete novice and I'm leading you down the >>> textbook storage architectural design path. Let's short circuit that as >>> I don't have the time. >>> >>> As you're starting from zero, let me give you what works best with 99% >>> of workloads. Use a chunk size of 32KB or 64KB. Such a chunk will work >>> extremely well with any singular or mixed workloads, on parity and >>> non-parity RAID. The only workload that should have a significantly >>> larger chunk than this is a purely streaming allocation workload of >>> large files. >>> >>> If you want a more technical explanation, you can read all of my >>> relevant posts in the linux-raid or XFS archives, as I've explained this >>> hundreds of times in great detail. Or you can wait a few months to read >>> the kernel documentation I'm working on, which will teach the reader the >>> formal storage stack design process, soup to nuts. I wish it was >>> already finished, as I could simply paste the link for you, which, >>> coincidentally, is the exact reason I'm writing it. :) >>> >>> >> Thank you Stan, hopefully you cover typical mixed use cases. I split my >> physical >> drives with partitions and built large chunk arrays on on set and small on >> the >> other, to cover my use cases of editing large video files and compiling >> kernels >> and large apps. >> >> The ext4 extended options stride= and stripe-width= can produce improvements >> in >> performance, particularly when writing a large file on an array with a small >> >> chunk size. My limited tests showed this helped more with raid6 than raid5. >> >> Since you're writing a document you can include that or not as it pleases >> you. >> >> -- >> Bill Davidsen <davidsen@xxxxxxx> >> "We have more to fear from the bungling of the incompetent than from >> the machinations of the wicked." - from Slashdot >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html