cc'ing Martin Petersen since I believe he is one of the most knowledgeable kernel hackers on this topic and has been working the issue for the last year. On Sun, Mar 7, 2010 at 10:48 PM, Tejun Heo <tj@xxxxxxxxxx> wrote: > Hello, guys. > > It looks like transition to ATA 4k drives will be quite painful and we > aren't really ready although these drives are already selling widely. > I've written up a summary document on the issue to clarify stuff as > it's getting more and more confusing and develop some consensus. It's > also on the linux ata wiki. > > http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues > > I've cc'd people whom I can think of off the top of my head but I > surely have missed some people who would have been interested. Please > feel free to add cc's or forward the message to other MLs. > Especially, I don't know much about partitioners so the details there > are pretty shallow and could be plain wrong. It would be great if > someone who knows more about this stuff can chime in. > > Thanks. > > === Document follows === > > ATA 4 KiB sector issues > > Background > ========== > > Up until recently, all ATA hard drives have been organized in 512 byte > sectors. For example, my 500 GB or 477 GiB hard drive is organized of > 976773168 512 byte sectors numbered from 0 to 976773167. This is how > a drive communicates with the driver. When the operating system wants > to read 32 KiB of data at 1 MiB position, the driver asks the drive to > read 64 sectors from LBA (Logical block address, sector number) 2048. > > Because each sector should be addressable, readable and writable > individually, the physical medium also is organized in the same sized > sectors. In addition to the area to store the actual data, each > sector requires extra space for book keeping - inter-sector space to > enable locating and addressing each sector and ECC data to detect and > correct inevitable raw data errors. > > As the densities and capacities of hard drives keep growing, stronger > ECC becomes necessary to guarantee acceptable level of data integrity > increasing the space overhead. In addition, in most applications, > hard drives are now accessed in units of at least 8 sectors or 4096 > bytes and maintaining 512 byte granularity has become somewhat > meaningless. > > This reached a point where enlarging the sector size to 4096 bytes > would yield measurably more usable space given the same raw data > storage size and hard drive manufacturers are transitioning to 4 KiB > sectors. > > Anandtech has a good article which illustrates the background and > issues with pretty diagrams[1]. > > > Physical vs. Logical > ==================== > > Because the 512 byte sector size has been around for a very long time > and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the > sector size assumption is scattered across all the layers - > controllers or bridge chips snooping commands, BIOSs, boot codes, > drivers, partitioners and system utilities, which makes it very > difficult to change the sector size from 512 byte without breaking > backward compatibility massively. > > As a workaround, the concept of logical sector size was introduced. > The physical medium is organized in 4 KiB sectors but the firmware on > the drive will present it as if the drive is composed of 512 byte > sectors thus making the drive behave as before, so if the driver asks > the hard drive to read 64 sectors from LBA 2048, the firmware will > translate it and read 8 4 KiB sectors from hardware sector 256. As a > result, the hard drive now has two sector sizes - the physical one > which the physical media is actually organized in, and the logical one > which the firmware presents to the outside world. > > A straight forward example mapping between physical sector and LBA > would be > > LBA = 8 * phys_sect > > > Alignment problem on 4 KiB physical / 512 logical drives > ======================================================= > > This workaround keeps older hardware and software working while > allowing the drive to use larger sector size internally. However, the > discrepancy between physical and logical sector sizes creates an > alignment issue. For example, if the driver wants to read 7 sectors > from LBA 2047, the firmware has to read hardware sector 255 and 256 > and trim leading 7*512 bytes and tailing 512 bytes. > > For reads, this isn't an issue as drives read in larger chunks anyway > but for writes, the drive has to do read-modify-write to achieve the > requested action. It has to first read hardware sector 255 and 256, > update requested parts and then write back those sectors which can > cause significant performance degradation[2]. > > The problem is aggravated by the way DOS partitions[3] have been laid > out traditionally. For reasons dating back more than two decades, > they are laid out considering something called disk geometry which > nowadays are arbitrary values with a number of restrictions for > backward compatibility accumulated over the years. The end result is > that until recently (most Linux variants and upto Windows XP) the > first partition ends up on sector 63 and later ones on cylinder > boundaries where each cylinder usually is composed of 255 * 63 > sectors. > > Most modern filesystems generate 4 KiB aligned accesses from the > partition it is in. If a drive maps 4 KiB physical sectors to 512 > byte logical sectors from LBA0, the filesystem in the first partition > will always be misaligned and filesystems in later partitions are > likely to be misaligned too. > > > Solving the alignment problem on 4 KiB physical / 512 logical drives > ==================================================================== > > There are multiple ways which attempt to solve the problem. > > S-1. Yet another workaround from the firmware - offset-by-one. > > Yet another workaround which can be done by the firmware is to > offset physical to logical mapping by one logical sector such that > LBA 63 ends up on physical sector boundary, which aligns the first > partition to physical sectors without requiring any software update. > The example mapping between phys_sector and LBA becomes > > LBA = 8 * phys_sect - 1 > > The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts > from after that point. phys_sect 1 maps to LBA 7 and phys_sect 8 to > 63, making LBA 63 aligned on hardware sector. > > Although this aligns only the first partition, for many use cases, > especially the ones involving older software, this workaround was > deemed useful and some recent drives with 4 KiB physical sectors are > equipped with a dip switch to turn on or off offset-by-one mapping. > > S-2. The proper solution. > > Correct alignments for all partitions can't be achieved by the > firmware alone. The system utilities should be informed about the > alignment requirements and align partitions accordingly. > > The above firmware workaround complicates the situation because the > two different configurations require different offsets to achieve > the correct alignments. ATA/ATAPI-8 specifies a way for a drive to > export the physical and logical sector sizes and the LBA offset > which is aligned to the physical sectors. > > In Linux, these parameters are exported via the following sysfs > nodes. > > physical sector size : /sys/block/sdX/queue/physical_block_size > logical sector size : /sys/block/sdX/queue/logical_block_size > alignment offset : /sys/block/sdX/alignment_offset > > Let the physical sector size be PSS, logical sector size LSS and > alignment offset AOFF. The system software should place partitions > such that the starting LBAs of all partitions are aligned on > > (n * PSS + AOFF) / LSS > > For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512 > and AOFF 3584 and with n of 7 the above becomes, > > (7 * 4096 + 3584) / 512 == 63 > > making sector 63 an aligned LBA where the first partition can be > put, but without the offset-by-one mapping, AOFF is zero and LBA 63 > is not aligned. > > With the above new alignment requirement in place, it becomes > difficult to honor the legacy one - first partition on sector 63 and > all other partitions on cylinder boundary (255 * 63 sectors) - as > the two alignment requirements contradict each other. This might be > worked around by adjusting how LBA and CHS addresses are mapped but > the disk geometry parameters are hard coded everywhere and there is > no reliable way to communicate custom geometry parameters. > > > Complications > ============= > > Unfortunately, there are complications. > > C-1. The standard is not and won't be followed as-is. > > Some of the existing BIOSs and/or drivers can't cope with drives > which report 4 KiB physical sector size. To work around this, some > drive models lie that its physical sector size is 512 bytes when the > actual configuration is 4 KiB without offsetting. > > This nullifies the provisions for alignment in the ATA standard but > results in the correct alignment for Windows Vista and 7. OS > behaviors will be described further later. > > For these drives, which are likely to continue to be shipped for the > foreseeable future, traditional LBA 63 and cylinder based aligning > results in misalignment. > > C-2. Windows XP depends on the traditional partition layout. > > Windows XP makes use of the CHS start/end addresses in the partition > table and gets confused if partitions are not laid out > traditionally. This means that XP can't be installed into a > partition prepared by later versions of Windows[4]. This isn't a > big problem for Windows because in most cases the later version is > replacing the older one, not the other way around. > > Unfortunately, the situation is more complex for Linux because Linux > is often co-installed with various versions of Windows and XP is > still quite popular. This means that when a Linux partitioner is > used to prepare a partition which may be used by Windows, the > partitioner might have to consider which version of Windows is going > to be used and whether to align the partitions for the correct > alignment or compatibility with older versions of Windows. > > C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size. > > The DOS partition format uses 32 bit for the starting LBA and the > number of sectors and, reportedly, 32 bit Windows XP shares the > limitation. With 32 bit addressing and 512 byte logical sector > size, the maximum addressable sector + 1 is at > > 2^32 * 2^9 == 2^41 == 2 TiB > > The DOS partition format allows a partition to reach beyond 2 TiB as > long as the starting LBA is under 2 TiB; however, both Windows XP > and and the Linux kernel (at least upto v2.6.33) refuse such > partition configurations. > > With the right combination of host controller, BIOS and driver, this > barrier can be overcome by enlarging the logical sector size to 4 > KiB, which will push the barrier out to 16 TiB. On the right > configuration, Windows XP is reportedly able to address beyond the 2 > TiB barrier with a DOS partition and 4 KiB logical sector size. > Linux kernel upto v2.6.33 doesn't work under such configurations but > a patch to make it work is pending[5]. > > This might also be beneficial for operating systems which don't > suffer from this limitation. A different partition format - GPT[6] > - should be used beyond 2^32 sectors, which could harm compatibility > with older BIOSs or other operating systems which don't recognize > the new format. > > As mentioned previously, 512 byte sector assumption has been there > for a very long time and changing it is likely to cause various > compatibility problems at many different layers from hardware up to > the system utilities. > > > Windows > ======= > > As hard drive vendors aim for performance and compatibility in modern > Windows environments, it is worthwhile to investigate how Windows > partitions with different alignment requirements. Up until Windows > XP, it followed the traditional layout - the first partition on LBA 63 > and the others on cylinder boundaries where a cylinder is defined as > 255 tracks with 63 sectors each. > > Windows Vista and 7 align partitions differently. As the two behave > similarly, only 7's behavior is shown here. These partition tables > are created by Windows 7 RC installer on blank disks. > > W-1. 512 byte physical and logical sector drive. > > ST FIRST T LAST LBA NBLKS > 80 202100 07 df130c 00080000 00200300 > 00 df140c 07 feffff 00280300 00689e12 > 00 000000 00 000000 00000000 00000000 > 00 000000 00 000000 00000000 00000000 > > Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk) > LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) > LBA 2048 + 204800 = 206848 > > Part1: FIRST C 12 H 223 S 20 : 206848 > LAST C 1023 H 254 S 63 : E > LBA 206848 + 312371200 = 312578048 > > Both aligned at (2048 * n). Part 1 not aligned to cylinder. > > W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one. > > ST FIRST T LAST LBA NBLKS > 80 202100 07 df130c 00080000 00200300 > 00 df140c 07 feffff 00280300 00b83f25 > 00 000000 00 000000 00000000 00000000 > 00 000000 00 000000 00000000 00000000 > > Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk) > LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) > LBA 2048 + 204800 = 206848 > > Part1: FIRST C 12 H 223 S 20 : 206848 > LAST C 1023 H 254 S 63 : E > LBA 206848 + 624932864 = 625139712 > > Both aligned at (2048 * n). Part 1 not aligned to cylinder. > > W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one. > > ST FIRST T LAST LBA NBLKS > 80 202800 07 df130c 07080000 f91f0300 > 00 df1b0c 07 feffff 07280300 f9376d74 > 00 000000 00 000000 00000000 00000000 > 00 000000 00 000000 00000000 00000000 > > Part0: FIRST C 0 H 32 S 40 : 2055 (63 sec/trk) > LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) > LBA 2055 + 204793 = 206848 > > Part1: FIRST C 12 H 223 S 27 : 206855 > LAST C 1023 H 254 S 63 : E > LBA 206855 + 1953314809 = 1953521664 > > Both aligned at (2048 * n + 7). Part 1 not aligned to cylinder. > > The partitioner seems to be using 1M as the basic alignment unit and > offsetting from there if explicitly requested by the drive and there > is no difference between handling of 512 byte and 4 KiB drives, which > explains why C-1 works for hard drive vendors. > > In all cases, the partitioner ignores both the first partition on LBA > 63 and the others on cylinder boundary requirements while still using > the same 255*63 cylinder size. Also, note that in W-3, both part 0 > and 1 end up with odd number of sectors. It seems that they simply > decided to completely break away from the traditional layout, which is > understandable given that there really isn't one good solution which > can cover all the cases and that the default larger alignment benefits > earlier SSDs. > > Windows Vista basically shows the same behavior. Vista was tested by > creating two partitions using the management tool. Test data is > available at [7]. > > *-alignment_offset : alignment_offset reported by Linux kernel > *-fdisk : fdisk -l output > *-fdisk-u : fdisk -lu output > *-hdparm : hdparm -I output > *-mbr : dump of mbr > *-part : decoded partition table from mbr > > Please note that hdparm is misreporting the alignment offset. It > should be reporting 512 instead of 256 for offset-by-one drives. > > > So, what now for Linux? > ======================= > > The situation is not easy. Considering all the factors, the only > workable solution looks like doing what Windows is doing. Hard drive > and SSD vendors are focusing on compatibility and performance on > recent Windows releases and are happy to do things which break the > standard defined mechanism as shown by C-1, so parting away from what > Windows does would be unnecessarily painful. > > Unfortunately, while Windows can assume that newer releases won't > share the hard drive with older releases including Windows XP, Linux > distros can't do that. There will be many installations where a > modern Linux distros share a hard drive with older releases of > Windows. At this point, I can't see a silver bullet solution. > > Partitioners maybe should only align partitions which will be used by > Linux and default to the traditional layout for others while allowing > explicit override. I think Windows XP wouldn't have problem with > differently aligned partitions as long as it doesn't actually use them > but haven't tested it. > > Reportedly, commonly used partitioners aren't ready to handle drives > larger than 2 TiB in any configuration and alignment isn't done > properly for drives with 4 KiB physical sectors. 4 KiB logical sector > support is broken in both the kernel and partitioners. (need more > details and probably a whole section on partitioner behaviors) > > Unfortunately, the transition to 4 KiB sector size, physical only or > logical too, is looking fairly ugly. Hopefully, a reasonable solution > can be reached in not too distant future but even with all the > software side updated, it looks like it's gonna cause significant > amount of confusion and frustration. > > > [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691 > [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives > [3] http://en.wikipedia.org/wiki/Master_boot_record > [4] http://support.microsoft.com/kb/931760 > [5] http://thread.gmane.org/gmane.linux.kernel/953981 > [6] http://en.wikipedia.org/wiki/GUID_Partition_Table > [7] http://userweb.kernel.org/~tj/partalign/ > > * Mar 04 2009 > Initial draft, Tejun Heo <tj@xxxxxxxxxx> > * Mar 08 2009 > Updated according to comments from Daniel Taylor > <Daniel.Taylor@xxxxxxx>. Other minor updates. > -- > To unsubscribe from this list: send the line "unsubscribe linux-ide" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer Preservation and Forensic processing of Exchange Repositories White Paper - <http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html> The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html