On 11/13/13, 3:26 PM, Dave Chinner wrote: > On Wed, Nov 13, 2013 at 01:08:30PM -0600, Eric Sandeen wrote: >> On 11/13/13, 12:56 PM, Christoph Hellwig wrote: >>> On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote: >>>> Pure RFC; this might be crazy. Here's the problem I'm trying to solve: >>>> >>>> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical >>>> drive. (that change was done by me). The thought was that it'd be an >>>> efficiency gain to not make the drive do the (possible) RMW cycles on >>>> 512-byte log IO, primarily. >>>> >>>> However, now this restricts all DIO to 4k alignment, not the otherwise- >>>> possible 512. >>>> >>>> This came up when qemu-kvm, in cache=none mode, tries to boot off an >>>> image hosted on such a filesystem, and its bios wants to do a 512 byte >>>> direct IO read off the disk - it fails. >>>> >>>> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used >>>> in a few places. >>> >>> No need to mess with kernel code IFF we want to change that, just keep >>> the sector size at 512 bytes and set a log stripe unit at mkfs time. >>> >>> I have to admit that I'm not really sure if that's what we really want, >>> through. A drive that has a larger physical block size will need >>> read-modify-write cycles internally, which we try to avoid. >> >> Yeah, the problem comes up when it is 100% impossible to boot a >> qemu-kvm guest hosted on such a filesystem/drive. :( > > No it's not. Just use cache=writethrough and the page cache will > take care of the mismatch when it occurs. Sorry, I meant impossible w/ cache=none. TBH, I don't know what best practice is. >> (of course I guess that means it fails on a hard 4k drive too) > > And on any other filesystem that thinks it has sectors larger than > 512 bytes underlying it (e.g. cdrom has a 2k sector size). > >> I don't know what the guest sees for logical/physical on its >> file-backed block device in these cases. > > Seems like that's the avenue for improvement here to me. i.e. expose > the correct values to the guest so it's mkfs does the right thing. > Or, alternatively, make qemu buffer non-aligned/sized IOs itself > internally. The guest never _boots_ - it's not a guest mkfs issue. The guest bios wants to read 512 via DIO off the image on this 4k sector FS, and fails. > After all, it has been told to use direct IO, and when that happens > it is the application's responsibility to ensure IO alignment > requirements are met... Agreed, but in talking to a qemu guy... "In my understanding, that's a limitation that directly comes from the BIOS interface." "int 13h just assumes 512 bytes" But this is above my pay grade. I don't speak BIOS. >> Anyway, if we took your suggestion, normal internal fs operations >> (log IO) wouldn't RMW. But we'd still presumably advertise and allow >> smaller DIO sizes, which are inefficient. We could advertise 4k, but >> still allow 512 for less-smart apps, maybe? > > I'd say such a problem is a matter of user education and making qemu > aware of logical/physical differences - hacking weird corner cases > into what a sector size means is only going to lead to confusion and > bite us in unexpected ways... Probably so; hence the "crazy" disclaimer. ;) But it does seem a little odd to semi-artificially reject DIOs which the drive could actually handle. Indeed, do_blockdev_direct_IO looks right at the logical block size, and allows it: if (offset & blocksize_mask) { if (bdev) blkbits = blksize_bits(bdev_logical_block_size(bdev)); blocksize_mask = (1 << blkbits) - 1; if (offset & blocksize_mask) goto out; } it's our checks in XFS that fail. -Eric > Cheers, > > Dave. > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs