On 10/5/18 6:27 AM, Ilya Dryomov wrote: > On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> >> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote: >>> On 10/4/18 12:58 PM, Ilya Dryomov wrote: >>>> rbd devices report the following geometry: >>>> >>>> $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0 >>>> 512 >>>> 512 >>>> 4194304 >>>> 4194304 >> >> dm-thinp does this as well. THis is from the thinp device created >> by tests/generic/459: >> >> 512 >> 4096 >> 65536 >> 65536 > > (adding Mike) > > ... and that 300M filesystem ends up with 8 AGs, when normally you get > 4 AGs for anything less than 4T. Is that really intended? Well, yes. Multi-disk mode gives you more AGs, how many more is scaled by fs size. /* * For the multidisk configs we choose an AG count based on the number * of data blocks available, trying to keep the number of AGs higher * than the single disk configurations. This makes the assumption that * larger filesystems have more parallelism available to them. */ For really tiny filesystems we cut down the number of AGs, but in general if the storage "told" us it has parallelism, mkfs uses it by default. > AFAIK dm-thinp reports these values for the same exact reason as rbd: > we are passing up the information about the efficient I/O size. In the > case of dm-thinp, this is the thinp block size. If you put dm-thinp on > top of a RAID array, I suspect it would pass up the array's preferred > sizes, as long as they are a proper factor of the thinp block size. > > The high agcount on dm-thinp has come up before and you suggested that > dm-thinp should report iomin == ioopt (i.e. sunit == swidth). If that > was the right fix back in 2014, mkfs.xfs must have regressed: > > https://marc.info/?l=linux-xfs&m=137783388617206&w=2 Well, that question started out as being about higher AG counts. And it did come from the stripe geometry advertised, but I don't think Dave's suggestion was intended to keep the AG count lower, it was just explaining that the values seemed wrong. It was a bit of an odd situation, the existence of stripe geometry bumped up AG count but then mkfs decided that a 512 byte "stripe unit" wasn't legit, and set it back to zero. But kept the higher AG count. So there a few related issues there. > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fdfb4c8c1a9fc8dd8cf8eeb4e3ed83573b375285 So that sets them to the same size, but I don't think that change was ever expected to change the AG count heuristic behavior, which has been there for probably a decade or more. >> >> And I've also seen some hardware raid controllers do this, too, >> because they only expose the stripe width in their enquiry page >> rather than stripe unit and stripe width. (which should be considered semi broken hardware, no?) >> IOWs, this behaviour isn't really specific to Ceph's rbd device, and >> it does occur on multi-disk devices that have something layered over >> the top (dm-thinp, hardware raid, etc). As such, I don't think >> there's a "one size fits all" solution and so someone is going to >> have to tweak mkfs settings to have it do the right thing for their >> storage subsystem.... > > FWIW I was surprised to see that calc_default_ag_geometry() doesn't > look at swidth and just assumes that there will be "more parallelism > available". I expected it to be based on swidth to sunit ratio (i.e. > sw). sw is supposed to be the multiplier equal to the number of > data-bearing disks, so it's the first thing that comes to mind for > a parallelism estimate. > > I'd argue that hardware RAID administrators are much more likely to pay > attention to the output of mkfs.xfs and be able to tweak the settings > to work around broken controllers that only expose stripe width. Yeah, this starts to get a little philosophical. We don't want to second guess geometry or try to figure out what the raid array "really meant" if it's sending weird numbers. [1] But at the end of the day, it seems reasonable to always apply the "swidth/sunit = number of data disks" rule (which we apply in reverse when we tell people how to manually figure out stripe widths) and stop treating sunit==swidth as any indication of parallelism. Dave, do you have any problem with changing the behavior to only go into multidisk if swidth > sunit? The more I think about it, the more it makes sense to me. -Eric [1] for example, 99a3f52 mkfs: avoid divide-by-zero when hardware reports optimal i/o size as 0