Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 14 Nov 2013 08:26:58 +1100

On Wed, Nov 13, 2013 at 01:08:30PM -0600, Eric Sandeen wrote:
> On 11/13/13, 12:56 PM, Christoph Hellwig wrote:
> > On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
> >> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
> >>
> >> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
> >> drive.  (that change was done by me).  The thought was that it'd be an
> >> efficiency gain to not make the drive do the (possible) RMW cycles on
> >> 512-byte log IO, primarily.
> >>
> >> However, now this restricts all DIO to 4k alignment, not the otherwise-
> >> possible 512.
> >>
> >> This came up when qemu-kvm, in cache=none mode, tries to boot off an
> >> image hosted on such a filesystem, and its bios wants to do a 512 byte
> >> direct IO read off the disk - it fails.
> >>
> >> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
> >> in a few places.  
> > 
> > No need to mess with kernel code IFF we want to change that, just keep
> > the sector size at 512 bytes and set a log stripe unit at mkfs time.
> > 
> > I have to admit that I'm not really sure if that's what we really want,
> > through.  A drive that has a larger physical block size will need
> > read-modify-write cycles internally, which we try to avoid.
> 
> Yeah, the problem comes up when it is 100% impossible to boot a
> qemu-kvm guest hosted on such a filesystem/drive.  :(

No it's not. Just use cache=writethrough and the page cache will
take care of the mismatch when it occurs.

> (of course I guess that means it fails on a hard 4k drive too)

And on any other filesystem that thinks it has sectors larger than
512 bytes underlying it (e.g. cdrom has a 2k sector size).

> I don't know what the guest sees for logical/physical on its
> file-backed block device in these cases.

Seems like that's the avenue for improvement here to me. i.e. expose
the correct values to the guest so it's mkfs does the right thing.
Or, alternatively, make qemu buffer non-aligned/sized IOs itself
internally.

After all, it has been told to use direct IO, and when that happens
it is the application's responsibility to ensure IO alignment
requirements are met...

> Anyway, if we took your suggestion, normal internal fs operations
> (log IO) wouldn't RMW.  But we'd still presumably advertise and allow
> smaller DIO sizes, which are inefficient.  We could advertise 4k, but
> still allow 512 for less-smart apps, maybe?

I'd say such a problem is a matter of user education and making qemu
aware of logical/physical differences - hacking weird corner cases
into what a sector size means is only going to lead to confusion and
bite us in unexpected ways...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs