Hello, This is the third draft mostly updated with the information gathered from the last discussion thread. The biggest changes are 1. Windows XP is generally fine with any alignment 2. upstream tools have already been updated to do proper aligning. So, the situation seems much better than I originally feared and as long as new distro releases ship with properly updated tools, everything should work. 4KiB logical sector size support and whether any tool would have problem with >32bit LBAs (>2TiB w/ 512 byte logical sector size) is still unclear to me. If you know, please let me know. I'll update the wiki page accordingly soonish. http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues Thank you. Background ========== Up until recently, all ATA hard drives have been organized in 512byte sectors. For example, my 500GB or 477GiB hard drive is organized of 976773168 512 byte sectors numbered from 0 to 976773167. This is how a drive communicates with the driver. When the operating system wants to read 32 KiB of data at 1MiB position, the driver asks the drive to read 64 sectors from LBA (Logical block address, sector number) 2048. Because each sector should be addressable, readable and writable individually, the physical medium also is organized in the same sized sectors. In addition to the area to store the actual data, each sector requires extra space for book keeping - inter-sector space to enable locating and addressing each sector and ECC data to detect and correct inevitable raw data errors. As the densities and capacities of hard drives keep growing, stronger ECC becomes necessary to guarantee acceptable level of data integrity increasing the space overhead. In addition, in most applications, hard drives are now accessed in units of at least 8 sectors or 4096 bytes and maintaining 512 byte granularity has become somewhat meaningless. This reached a point where enlarging the sector size to 4096 bytes would yield measurably more usable space given the same raw data storage size and hard drive manufacturers are transitioning to 4KiB sectors. Anandtech has a good article which illustrates the background and issues with pretty diagrams[1]. Physical vs. Logical ==================== Because the 512 byte sector size has been around for a very long time and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the sector size assumption is scattered across all the layers - controllers or bridge chips snooping commands, BIOSs, boot codes, drivers, partitioners and system utilities, which makes it very difficult to change the sector size from 512 byte without breaking backward compatibility massively. As a workaround, the concept of logical sector size was introduced. The physical medium is organized in 4KiB sectors but the firmware on the drive will present it as if the drive is composed of 512 byte sectors thus making the drive behave as before, so if the driver asks the hard drive to read 64 sectors from LBA 2048, the firmware will translate it and read 8 4KiB sectors from hardware sector 256. As a result, the hard drive now has two sector sizes - the physical one which the physical media is actually organized in, and the logical one which the firmware presents to the outside world. A straight forward example mapping between physical sector and LBA would be LBA = 8 * phys_sect Alignment problem on 4KiB physical / 512 logical drives ======================================================= This workaround keeps older hardware and software working while allowing the drive to use larger sector size internally. However, the discrepancy between physical and logical sector sizes creates an alignment issue. For example, if the driver wants to read 7 sectors from LBA 2047, the firmware has to read hardware sector 255 and 256 and trim leading 7*512 bytes and tailing 512 bytes. For reads, this isn't an issue as drives read in larger chunks anyway but for writes, the drive has to do read-modify-write to achieve the requested action. It has to first read hardware sector 255 and 256, update requested parts and then write back those sectors which can cause significant performance degradation[2]. The problem is aggravated by the way DOS partitions[3] have been laid out traditionally. For reasons dating back more than two decades, they are laid out considering something called disk geometry which nowadays are arbitrary values with a number of restrictions for backward compatibility accumulated over the years. The end result is that until recently (most Linux variants and upto Windows XP) the first partition ends up on sector 63 and later ones on cylinder boundaries where each cylinder usually is composed of 255 * 63 sectors. Most modern filesystems generate 4KiB aligned accesses from the partition it is in. If a drive maps 4KiB physical sectors to 512 byte logical sectors from LBA0, the filesystem in the first partition will always be misaligned and filesystems in later partitions are likely to be misaligned too. Solving the alignment problem on 4KiB physical / 512 logical drives =================================================================== There are multiple ways which attempt to solve the problem. S-1. Yet another workaround from the firmware - offset-by-one. Yet another workaround which can be done by the firmware is to offset physical to logical mapping by one logical sector such that LBA 63 ends up on physical sector boundary, which aligns the first partition to physical sectors without requiring any software update. The example mapping between phys_sector and LBA becomes LBA = 8 * phys_sect - 1 The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts from after that point. phys_sect 1 maps to LBA 7 and phys_sect 8 to 63, making LBA 63 aligned on hardware sector. Although this aligns only the first partition, for many use cases, especially the ones involving older software, this workaround was deemed useful and some recent drives with 4KiB physical sectors are equipped with a dip switch to turn on or off offset-by-one mapping. S-2. The proper solution. Correct alignments for all partitions can't be achieved by the firmware alone. The system utilities should be informed about the alignment requirements and align partitions accordingly. The above firmware workaround complicates the situation because the two different configurations require different offsets to achieve the correct alignments. ATA/ATAPI-8 specifies a way for a drive to export the physical and logical sector sizes and the LBA offset which is aligned to the physical sectors. In Linux, these parameters are exported via the following sysfs nodes. physical sector size : /sys/block/sdX/queue/physical_block_size logical sector size : /sys/block/sdX/queue/logical_block_size alignment offset : /sys/block/sdX/alignment_offset Let the physical sector size be PSS, logical sector size LSS and alignment offset AOFF. The system software should place partitions such that the starting LBAs of all partitions are aligned on (n * PSS + AOFF) / LSS For 4KiB physical sector offset-by-one drives, PSS is 4096, LSS 512 and AOFF 3584 and with n of 7 the above becomes, (7 * 4096 + 3584) / 512 == 63 making sector 63 an aligned LBA where the first partition can be put, but without the offset-by-one mapping, AOFF is zero and LBA 63 is not aligned. With the above new alignment requirement in place, it becomes difficult to honor the legacy one - first partition on sector 63 and all other partitions on cylinder boundary (255 * 63 sectors) - as the two alignment requirements contradict each other. This might be worked around by adjusting how LBA and CHS addresses are mapped but the disk geometry parameters are hard coded in some places and there is no reliable way to communicate custom geometry parameters. Complications ============= Unfortunately, there are complications. C-1. The standard is not and won't be followed as-is. Some of the existing BIOSs and/or drivers can't cope with drives which report 4KiB physical sector size. To work around this, some drive models lie that its physical sector size is 512 bytes when the actual configuration is 4KiB without offsetting. This nullifies the provisions for alignment in the ATA standard but results in the correct alignment for Windows Vista and 7. OS behaviors will be described further later. For these drives, which are likely to continue to be shipped for the foreseeable future, traditional LBA 63 and cylinder based aligning results in misalignment. C-2. The 2TiB barrier and the possibility for 4KiB logical sector size. The DOS partition format uses 32 bit for the starting LBA and the number of sectors and, reportedly, 32 bit Windows XP shares the limitation. With 32 bit addressing and 512 byte logical sector size, the maximum addressable sector + 1 is at 2^32 * 2^9 == 2^41 == 2TiB The DOS partition format allows a partition to reach beyond 2TiB as long as the starting LBA is under 2TiB; however, both Windows XP and and the Linux kernel (at least upto v2.6.33) refuse such partition configurations. With the right combination of host controller, BIOS and driver, this barrier can be overcome by enlarging the logical sector size to 4KiB, which will push the barrier out to 16TiB. On the right configuration, Windows XP is reportedly able to address beyond the 2TiB barrier with a DOS partition and 4KiB logical sector size. Linux kernel upto v2.6.33 doesn't work under such configurations but a patch to make it work is pending[4]. This might also be somewhat beneficial for operating systems which don't suffer from this limitation. A different partition format - GPT[5] - should be used beyond 2^32 sectors, which could harm compatibility with other operating systems which don't recognize the new format. As mentioned previously, 512 byte sector assumption has existed for a very long time and changing it is might cause various compatibility problems at different layers. It has been suggested that 4KiB logical sector size might be primarily useful for external (USB or otherwise) drives. Windows ======= As hard drive vendors aim for performance and compatibility in modern Windows environments, it is worthwhile to investigate how Windows behaves and partitions with different alignment requirements. Although there seem to be some issues with certain BIOS settings[6], any releases after and including Windows XP do not depend on traditional partition alignment and can boot from partitions with any alignment. The reported problem seems to be caused by BIOS trying to guess geometry by reading from the partition table instead of using the de-facto geometry of 255 * 63 and can be worked around by either changing BIOS configuration or applying a hotfix. It is reported that Windows 2000 depends on the traditional partition layout and will not work properly on partitions aligned differently. When partitioning for Windows 2000, it will be necessary to follow traditional partition layout; however, given the largely diminished Windows 2000 user-base, this won't be a big problem. Having a way to manually choose traditional alignment should be enough. When asked to partition hard drives, up until Windows XP, Windows followed the traditional layout - the first partition on LBA 63 and the others on cylinder boundaries where a cylinder is defined as 255 tracks with 63 sectors each. Windows Vista and 7 align partitions differently. As the two behave similarly, only 7's behavior is shown here. These partition tables are created by Windows 7 RC installer on blank disks. W-1. 512 byte physical and logical sector drive. ST FIRST T LAST LBA NBLKS 80 202100 07 df130c 00080000 00200300 00 df140c 07 feffff 00280300 00689e12 00 000000 00 000000 00000000 00000000 00 000000 00 000000 00000000 00000000 Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk) LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) LBA 2048 + 204800 = 206848 Part1: FIRST C 12 H 223 S 20 : 206848 LAST C 1023 H 254 S 63 : E LBA 206848 + 312371200 = 312578048 Both aligned at (2048 * n). Part 1 not aligned to cylinder. W-2. 4KiB physical and 512 byte logical sector drive without offset-by-one. ST FIRST T LAST LBA NBLKS 80 202100 07 df130c 00080000 00200300 00 df140c 07 feffff 00280300 00b83f25 00 000000 00 000000 00000000 00000000 00 000000 00 000000 00000000 00000000 Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk) LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) LBA 2048 + 204800 = 206848 Part1: FIRST C 12 H 223 S 20 : 206848 LAST C 1023 H 254 S 63 : E LBA 206848 + 624932864 = 625139712 Both aligned at (2048 * n). Part 1 not aligned to cylinder. W-3. 4KiB physical and 512 byte logical sector drive with offset-by-one. ST FIRST T LAST LBA NBLKS 80 202800 07 df130c 07080000 f91f0300 00 df1b0c 07 feffff 07280300 f9376d74 00 000000 00 000000 00000000 00000000 00 000000 00 000000 00000000 00000000 Part0: FIRST C 0 H 32 S 40 : 2055 (63 sec/trk) LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) LBA 2055 + 204793 = 206848 Part1: FIRST C 12 H 223 S 27 : 206855 LAST C 1023 H 254 S 63 : E LBA 206855 + 1953314809 = 1953521664 Both aligned at (2048 * n + 7). Part 1 not aligned to cylinder. The partitioner seems to be using 1M as the basic alignment unit and offsetting from there if explicitly requested by the drive and there is no difference between handling of 512 byte and 4KiB drives, which explains why C-1 works for hard drive vendors. In all cases, the partitioner ignores both the first partition on LBA 63 and the others on cylinder boundary requirements while still using the same 255 * 63 cylinder size. Also, note that in W-3, both part 0 and 1 end up with odd number of sectors. It seems that they simply decided to completely break away from the traditional layout, which is understandable given that there really isn't one good solution which can cover all the cases and that the default larger alignment benefits earlier SSDs. Windows Vista basically shows the same behavior. Vista was tested by creating two partitions using the management tool. Test data is available at [7]. *-alignment_offset : alignment_offset reported by Linux kernel *-fdisk : fdisk -l output *-fdisk-u : fdisk -lu output *-hdparm : hdparm -I output *-mbr : dump of mbr *-part : decoded partition table from mbr Please note that hdparm is misreporting the alignment offset. It should be reporting 512 instead of 256 for offset-by-one drives. This problem is fixed by version 9.28. Where Linux stands ================== Considering all the factors, the best workable solution seems to be doing what Windows is doing. Hard drive and SSD vendors are focusing on compatibility and performance on recent Windows releases and are happy to do things which break the standard defined mechanism as shown by C-1, so parting away from what Windows does would be unnecessarily painful. Other than giving an option to use traditional layout for Windows releases <= 2000, always using larger alignment will achieve properly aligned partitions and acceptable compatibility. Most of information in this section comes from the discussion thread reviewing an early draft of this document[8] and the following two documents. I/O Limits: block sizes, alignment and I/O hints - Mike Snitzer [9] Linux & Advanced Storage Interfaces - Martin K. Petersen [10] L-1. Kernel support Various storage parameters including physical and logical sector sizes and alignment requirements are exported via IO limits and storage topology support. The kernel gathers all the relevant parameters, combine them according to storage organization and export them to userspace. As of v2.6.33, the support covers most of Linux I/O stacks including but not limited to ATA and any mass storage device driven by the SCSI disk driver and complex devices composed using MD, DM and LVM. IO topology support is being extended to cover virtualized storage devices. As of v2.6.33, Linux ATA drivers do not support drives with 4KiB logical sector size although there is a development branch containing experimental support[11]. For ATA drives connected via bridges to different buses - USB and IEEE 1394, as long as the bridges support 4KiB logical sector size correctly, the SCSI disk driver can handle them. There currently is a limitation in DOS partition handling which prevents DOS partitions to grow over 2TiB even with 4KiB sector size but this is being worked on[4]. L-2. Userspace tools status (thanks to Karel Zak[12]) * libblkid provides unified API to topology information, it supports: * ioctls (kernel >= 2.6.32) * sysfs (kernel >= 2.6.31) * stripe chunk size and stripe width for DM, MD. LVM and evms on old kernels * libparted and fdisk are linked against libblkid * fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15 * fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17) * fdisk uses 1MiB alignment (or more if optimal I/O size is bigger) and alignment_offset for all partitions in non-DOS mode (util-linux-ng >= 2.17.1) * parted supports 4KiB physical sector size * parted uses 1MiB alignment for disks with unknown topology, disks with topology information are aligned to optimal (or minimum) I/O size (parted >= 2.1) * The latest news on parted status can be found here[13] * EFI GPT code in the kernel has been updated to works properly with 4KiB sectors (kernel >= 2.6.33) * mkfs.{ext,xfs,gfs2,ocfs2} have been updated to work properly with topology information, mkfs.{ext,xfs} are linked against libblkid for compatibility with old kernel (for stripe chunk size / width) * Fedora-13/RHEL6 installer uses libparted with 4KiB support * alignment_offset & 4KiB support is planned for LUKS (cryptsetup) Overall, distributions being released after Spring of 2010 with the updated tools shouldn't have much problem aligning and dealing with 4KiB physical sector drives. If you are working on or testing a distro, please make sure all storage related tools are up-to-date and aligning disks properly. L-3. Booting and boot loaders On traditional PC configurations, Linux booting is done in several stages. The BIOS should be able to probe and access the drive. It reads the MBR off the drive and pass control to it. MBR contains initial chunk of bootloader and reads more data (often off the same drive) necessary for booting - usually further stages of boot loader. This process repeats as necessary until the kernel and module images are loaded and control is passed to it. There can be different issues at various layers. At the BIOS level, the following problems have been reported or are suspected. * Some reportedly have issues accessing drives which report hardware sector size which is larger than 512 bytes even if the logical sector size remains 512 bytes (see C-1). * INT13h EDD uses 64bit LBA but some BIOSs might have problems with accessing drives which have higher capacity than 2TiB (32 bit limit). * Depending on the BIOS configuration, some read the partition table and solve CHS/LBA equations to figure out the geometry used during partitioning which seems to cause compatibility problems with partitions which don't consider geometry alignment at all[6]. * It's reasonable to suspect that some (or rather, many) BIOSs wouldn't be able to access or boot off ATA drives with 4KiB logical sector size. Despite the various problems, in general, all a BIOS needs to boot from a hard drive is reading the MBR off it and as long as logical block size remains at 512 bytes, most BIOSs should be able to boot off large and/or differently aligned drives. On top of working BIOS access to the drives, boot loaders may have additional dependencies. For example, GRUB needs to understand the partition table format and the filesystem itself to retrieve the kernel image and modules, while LILO hard codes LBAs of needed blocks and thus doesn't care about how the blocks are logically organized. * As long as the BIOS can access the hard drive, LILO should be able to boot regardless of partition table format or alignment. However, it is yet unknown whether there would be hidden issues with >2TiB hard drives or 4KiB logical sector size (if you know or have tested, please let me know). * GRUB is not affected by partition alignment. According to GRUB2 wiki Current Status page, it supports GPT and presumably >2TiB disks. It is unclear how 4KiB logical sector size would work (please let me know). Support status for GRUB legacy (0.9.x) is rather unclear but seems to require a patch to make GPT work. >2TiB support status is unclear (again...). * H. Peter Anvin reports that syslinux should work fine with any alignment and GPT with gptmbr.bin installed[14]. 4KiB logical sector support has bit-rotted but he intends to update it[15]. >2TiB support status is unclear (plz let me know). Random thoughts and comments (mostly for distros) ================================================= * All upstream partitioning tools have been updated properly regarding alignment. They either already default to larger alignment or are scheduled to switch to it. For new releases, please make sure all the tools are up-to-date and larger alignment rules are in effect. Windows >= XP wouldn't have any problem sharing or booting from partition prepared with larger alignment, so compatibility implications will not be major. Providing a mechanism to force legacy cylinder alignment or describing a way to manually create partitions with legacy layout should be enough. * In newer releases of fdisk (util-linux-ng >= 2.17.1), traditional cylinder based alignment can be requested by turning on DOS Compatibility flag (the 'c' command). * In case INT13h EDD has problems accessing sectors beyond 2TiB, it would be better to put data necessary for booting inside a boot partition which is contained inside 2TiB limit. * GPT is unavoidable for 512 byte logical sector drives which is larger than 2TiB and there are clear advantages of GPT such as better protection against corruption, lack of artificial distinctions between primary and extended/logical partitions. When compatibility with older software is not an issue, it could be better to default to GPT. * Drives >2TiB and 4KiB logical sector size support status seems unclear. It will be great if we can get proper prototype hardware into upstream developers' hands and make sure software side is ready before the actual products hit the market. Document history ================ * Mar 04 2010 Tejun Heo <tj@xxxxxxxxxx> Initial draft. * Mar 08 2010 Tejun Heo <tj@xxxxxxxxxx> Updated according to comments from Daniel Taylor <Daniel.Taylor@xxxxxxx>. Other minor updates. * Mar 15 2010 Tejun Heo <tj@xxxxxxxxxx> Updated according to various comments from discussions[8] on LKML and linux-ide. References ========== [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691 [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives [3] http://en.wikipedia.org/wiki/Master_boot_record [4] http://thread.gmane.org/gmane.linux.kernel/953981 [5] http://en.wikipedia.org/wiki/GUID_Partition_Table [6] http://support.microsoft.com/kb/931760 [7] http://userweb.kernel.org/~tj/partalign/ [8] http://thread.gmane.org/gmane.linux.ide/45211 [9] http://people.redhat.com/msnitzer/docs/io-limits.txt [10] http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf [11] git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git sectsize [12] http://article.gmane.org/gmane.linux.ide/45228 [13] http://git.debian.org/?p=parted/parted.git;a=blob;f=NEWS [14] http://article.gmane.org/gmane.linux.ide/45293 [15] http://article.gmane.org/gmane.linux.ide/45214 -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html