On Mon, 21 Oct 2013, Mike Snitzer wrote: > On Mon, Oct 21 2013 at 12:02pm -0400, > Sage Weil <sage@xxxxxxxxxxx> wrote: > > > On Mon, 21 Oct 2013, Mike Snitzer wrote: > > > On Mon, Oct 21 2013 at 10:11am -0400, > > > Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > > > > > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote: > > > > > It looks like without LVM we're getting 128KB requests (which IIRC is > > > > > typical), but with LVM it's only 4KB. Unfortunately my memory is a bit > > > > > fuzzy here, but I seem to recall a property on the request_queue or device > > > > > that affected this. RBD is currently doing > > > > > > > > Unfortunately most device mapper modules still split all I/O into 4k > > > > chunks before handling them. They rely on the elevator to merge them > > > > back together down the line, which isn't overly efficient but should at > > > > least provide larger segments for the common cases. > > > > > > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem > > > no? Unless care is taken to assemble larger bios (higher up the IO > > > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets > > > in $PAGE_SIZE granularity. > > > > > > I would expect direct IO to before better here because it will make use > > > of bio_add_page to build up larger IOs. > > > > I do know that we regularly see 128 KB requests when we put XFS (or > > whatever else) directly on top of /dev/rbd*. > > Should be pretty straight-forward to identify any limits that are > different by walking sysfs/queue, e.g.: > > grep -r . /sys/block/rdbXXX/queue > vs > grep -r . /sys/block/dm-X/queue > > Could be there is an unexpected difference. For instance, there was > this fix recently: http://patchwork.usersys.redhat.com/patch/69661/ > > > > Taking a step back, the rbd driver is exposing both the minimum_io_size > > > and optimal_io_size as 4M. This symmetry will cause XFS to _not_ detect > > > the exposed limits as striping. Therefore, AFAIK, XFS won't take steps > > > to respect the limits when it assembles its bios (via bio_add_page). > > > > > > Sage, any reason why you don't use traditional raid geomtry based IO > > > limits?, e.g.: > > > > > > minimum_io_size = raid chunk size > > > optimal_io_size = raid chunk size * N stripes (aka full stripe) > > > > We are... by default we stripe 4M chunks across 4M objects. You're > > suggesting it would actually help to advertise a smaller minimim_io_size > > (say, 1MB)? This could easily be made tunable. > > You're striping 4MB chunks across 4 million stripes? > > So the full stripe size in bytes is 17592186044416 (or 16TB)? Yeah > cannot see how XFS could make use of that ;) Sorry, I mean the stripe count is effectively 1. Each 4MB gets mapped to a new 4MB object (for a total of image_size / 4MB objects). So I think minimum_io_size and optimal_io_size are technically correct in this case. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html