On 6/5/2012 11:16 PM, Roman Mamedov wrote: > On Tue, 05 Jun 2012 15:36:29 -0500 > Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote: > >>> Except this would not make any sense even as a thought experiment. You don't >>> want a configuration where two or more areas of the same physical disk need to >>> be accessed in parallel for any read or write to the volume. And it's pretty >>> easy to avoid that. >> >> You make a good point but your backing argument is incorrect: XFS by >> design, by default, writes to 4 equal sized regions of a disk in parallel. > > I said: "...need to be accessed in parallel for any read or write". > > With XFS you mean allocation groups, however I don't think that if you write > any large file sequentially to XFS, it will always cause drive's head to jump > around between four areas because the file is written "in parallel", striped > to four different locations, which is the main problem that we're trying to > avoid. It depends on which allocator you use. Inode32, the default allocator, can cause a sufficiently large file's blocks to be rotored across all AGs in parallel. Inode64 writes one file to one AG. > XFS allocation groups are each a bit like an independent filesystem, This analogy may be somewhat relevant to the Inoe64 allocator, which stores directory metadata for a file in the same AG where the file is stored. But it definitely does not describe the Inode32 allocator behavior, which stores all metadata in the first 1TB of the FS, and all file extents above 1TB. Dependent on the total FS size, obviously. I described the maximal design case here where the FS is hard limited to 16TB. > to allow > for some CPU- and RAM-access-level parallelization. The focus of the concurrency mechanisms in XFS have always been on maximizing disk array performance and flexibility with very large disk counts and large numbers of concurrent accesses. Much of the parallel CPU/mem locality efficiency is a side effect of this, not the main target of the efforts, though there have been some of these. > However spinning devices > and even SSDs can't really read or write quickly enough "in parallel", so > parallel access to different areas of the same device is used in XFS not for > *any read or write*, but only in those cases where that can be beneficial for > performance I just reread that 4 times. If I'm correctly reading what you stated, then you are absolutely not correct. Please read about XFS allocation group design: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html and behavior of the allocators: http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/xfs-allocators.html > -- and even then, likely managed carefully either by XFS or by XFS is completely unaware of actuator placement or any such parameters internal to a block device. It operates above the block layer. It is after all a filesystem. > lower level of I/O schedulers to minimize head movements. The Linux elevators aren't going to be able to minimize actuator movement to much degree in this scenario, if/when there is concurrent full stripe write access in all md arrays on the drives. This problem will likely be further exacerbated if XFS is the filesystem used on each array. By default mkfs.xfs creates 16 AGs if the underlying device is a striped md array. Thus... If you have 4 drives and 4 md RAID 10 arrays across 4 partitions on the drives, then format each with mkfs.xfs defaults, you end up with 64 AGs in 4 XFS filesystems. With the default Inode32 allocator, you could end up with 4 concurrent file writes causing 64 actuator seeks per disk. With average 7.2k SATA drives this takes about 0.43 seconds to write 64 sectors, 32KB, to each drive, almost half a second for each 128KB written to all arrays concurrently, 1 second to write 256KB across 4 disks. If you used a single md RAID 10 array, you cut your seek load by a factor of 4. Now, there are ways to manually tweak such a setup to reduce the number of AGs and thus seeks, but this is only one of multiple reasons not use to multiple striped md arrays on the same set of disks, which was/is my original argument. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html