On Tue, Nov 15, 2016 at 03:41:58PM -0700, Jens Axboe wrote: > On 11/15/2016 03:27 PM, Johannes Weiner wrote: > > Hi Jens, > > > > On Thu, Nov 10, 2016 at 10:00:37AM -0700, Jens Axboe wrote: > > > Hi, > > > > > > We ran into a funky issue, where someone doing 256K buffered reads saw > > > 128K requests at the device level. Turns out it is read-ahead capping > > > the request size, since we use 128K as the default setting. This doesn't > > > make a lot of sense - if someone is issuing 256K reads, they should see > > > 256K reads, regardless of the read-ahead setting. > > > > > > To make matters more confusing, there's an odd interaction with the > > > fadvise hint setting. If we tell the kernel we're doing sequential IO on > > > this file descriptor, we can get twice the read-ahead size. But if we > > > tell the kernel that we are doing random IO, hence disabling read-ahead, > > > we do get nice 256K requests at the lower level. An application > > > developer will be, rightfully, scratching his head at this point, > > > wondering wtf is going on. A good one will dive into the kernel source, > > > and silently weep. > > > > > > This patch introduces a bdi hint, io_pages. This is the soft max IO size > > > for the lower level, I've hooked it up to the bdev settings here. > > > Read-ahead is modified to issue the maximum of the user request size, > > > and the read-ahead max size, but capped to the max request size on the > > > device side. The latter is done to avoid reading ahead too much, if the > > > application asks for a huge read. With this patch, the kernel behaves > > > like the application expects. > > > > > > > > > diff --git a/block/blk-settings.c b/block/blk-settings.c > > > index f679ae122843..65f16cf4f850 100644 > > > --- a/block/blk-settings.c > > > +++ b/block/blk-settings.c > > > @@ -249,6 +249,7 @@ void blk_queue_max_hw_sectors(struct request_queue *q, > > > unsigned int max_hw_secto > > > max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors); > > > max_sectors = min_t(unsigned int, max_sectors, BLK_DEF_MAX_SECTORS); > > > limits->max_sectors = max_sectors; > > > + q->backing_dev_info.io_pages = max_sectors >> (PAGE_SHIFT - 9); > > > > Could we simply set q->backing_dev_info.ra_pages here? This would > > start the disk out with a less magical readahead setting than the > > current 128k default, while retaining the ability for the user to > > override it in sysfs later on. Plus, one less attribute to juggle. > > We could, but then we'd have two places that tweak the same knob. I > think it's perfectly valid to have the read-ahead size be bigger than > the max request size, if you want some pipelining, for instance. I'm not sure I follow. Which would be the two places and which knob? What I meant how it could work is this: when the queue gets allocated, we set ra_pages to the hard-coded 128K, like we do right now. When the driver initializes and calls blk_queue_max_hw_sectors() it would set ra_pages to the more informed, device-optimized max_sectors >> (PAGE_SHIFT - 9). And once it's all initialized, the user can still make adjustments to the default we picked in the kernel heuristic. > The 128k default is silly, though, that should be smarter. It should > probably default to the max request size. Could you clarify the difference between max request size and what blk_queue_max_hw_sectors() sets? The way I understood your patch is that we want to use a readahead cap that's better suited to the underlying IO device than the magic 128K. What am I missing? > > > @@ -369,10 +369,18 @@ ondemand_readahead(struct address_space *mapping, > > > bool hit_readahead_marker, pgoff_t offset, > > > unsigned long req_size) > > > { > > > - unsigned long max = ra->ra_pages; > > > + unsigned long max_pages; > > > pgoff_t prev_offset; > > > > > > /* > > > + * Use the max of the read-ahead pages setting and the requested IO > > > + * size, and then the min of that and the soft IO size for the > > > + * underlying device. > > > + */ > > > + max_pages = max_t(unsigned long, ra->ra_pages, req_size); > > > + max_pages = min_not_zero(inode_to_bdi(mapping->host)->io_pages, max_pages); > > > > This code would then go away, and it would apply the benefit of this > > patch automatically to explicit readahead(2) and FADV_WILLNEED calls > > going through force_page_cache_readahead() as well. > > The path from the force actually works, which is why you get the weird > behavior with a file marked as RANDOM getting the full request size, and > not being limited by ra_pages. How so? do_generic_file_read() calls page_cache_sync_readahead(), and if the file is marked random it goes to force_page_cache_readahead(): void page_cache_sync_readahead(struct address_space *mapping, struct file_ra_state *ra, struct file *filp, pgoff_t offset, unsigned long req_size) { /* no read-ahead */ if (!ra->ra_pages) return; /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM)) { force_page_cache_readahead(mapping, filp, offset, req_size); return; } /* do read-ahead */ ondemand_readahead(mapping, ra, filp, false, offset, req_size); } That function in turn still caps the reads to the default 128K ra_pages: int force_page_cache_readahead(struct address_space *mapping, struct file *filp, pgoff_t offset, unsigned long nr_to_read) { if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages)) return -EINVAL; nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages); while (nr_to_read) { int err; unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE; if (this_chunk > nr_to_read) this_chunk = nr_to_read; err = __do_page_cache_readahead(mapping, filp, offset, this_chunk, 0); if (err < 0) return err; offset += this_chunk; nr_to_read -= this_chunk; } return 0; } How could you get IO requests bigger than the 128k ra_pages there? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html