Re: Kernel RBD Readahead

Ilya Dryomov <idryomov@xxxxxxxxx> · Mon, 24 Aug 2015 18:06:58 +0300

On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>> -----Original Message-----
>>> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
>>> Sent: 23 August 2015 18:33
>>> To: Nick Fisk <nick@xxxxxxxxxx>
>>> Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>
>>> Subject: Re: Kernel RBD Readahead
>>>
>>> On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>> > Hi Ilya,
>>> >
>>> > I was wondering if I could just get your thoughts on a matter I have
>>> > run into?
>>> >
>>> > Its surrounding read performance of the RBD kernel client and blk-mq,
>>> > mainly when doing large single threaded reads. During testing
>>> > performance seems to be limited to around 40MB/s, which is probably
>>> > fairly similar to what you would expect to get from a single OSD. This
>>> > is to be expected as an RBD is just a long chain of objects each on a
>>> > different OSD which is being read through in order one at a time.
>>> >
>>> > In theory readahead should make up for this by making the RBD client
>>> > read from several OSD’s ahead of the current required block. However
>>> > from what I can see it seems that setting a readahead value higher
>>> > than max_sectors_kb doesn’t appear to have any effect, meaning that
>>> > readahead is limited to each object that is currently being read.
>>> > Would you be able to confirm if this is correct and if this is by design?
>>>
>>> [CCing ceph-devel]
>>>
>>> Certainly not by design.  rbd is just a block device driver, so if the kernel
>>> submits a readahead read, it will obey and carry it out in full.
>>> The readahead is driven by the VM in pages, it doesn't care about rbd object
>>> boundaries and such.
>>>
>>> That said, one problem is in the VM subsystem, where readaheads get
>>> capped at 512 pages (= 2M).  If you do a simple single threaded read test,
>>> you'll see 4096 sector (= 2M) I/Os instead of object size I/Os:
>>>
>>>     $ rbd info foo | grep order
>>>             order 24 (16384 kB objects)
>>>     $ blockdev --getra /dev/rbd0
>>>     32768
>>>     $ dd if=/dev/rbd0 of=/dev/null bs=32M
>>>     # avgrq-sz is 4096.00
>>>
>>> This was introduced in commit 6d2be915e589 ("mm/readahead.c: fix
>>> readahead failure for memoryless NUMA nodes and limit readahead pages")
>>> [1], which went into 3.15.  The hard limit was Linus' suggestion, apparently.
>>>
>>> #define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
>>> /*
>>>  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
>>>  * sensible upper limit.
>>>  */
>>> unsigned long max_sane_readahead(unsigned long nr) {
>>>         return min(nr, MAX_READAHEAD);
>>> }
>>>
>>> This limit used to be dynamic and depended on the number of free pages in
>>> the system.  There has been an attempt to bring that behaviour back [2], but
>>> it didn't go very far as far as getting into mainline.  It looks like Red Hat and
>>> Oracle are shipping [2] in some of their kernels though.  If you apply it, you'll
>>> see 32768 sector (= 16M) I/Os in the above test, which is how it should be.
>>>
>>> [1]
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6d
>>> 2be915e589b58cb11418cbe1f22ff90732b6ac
>>> [2] http://thread.gmane.org/gmane.linux.kernel/1893680
>>>
>>> One thing we should be doing is setting read_ahead_kb to object size, the
>>> default 128k doesn't really cut it for rbd.  I'll send a patch for that.
>>>
>>> Thanks,
>>>
>>>                 Ilya
>>
>>
>> Thanks for your response.
>>
>> I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so that is likely to be partly the cause of the poor performance I am seeing. However I tried a 3.14 kernel and saw the same level performance, but this time the IO's were limited to 1024 sectors. The queue depth was at around 8 so I guess this means its submitting 8*512kb IO's up to the max_sectors_kb limit of 4096KB. From the OSD point of view, this will still be accessing 1 OSD at a time.
>>
>> Maybe I'm expecting the wrong results but I was expecting that either 1 of these 2 scenarios would happen.
>>
>> 1. Kernel submits a large enough IO to satisfy the readahead value, max_sectors_kb would need to be higher than the object size (currently not possible) and the RADOS layer would be responsible for doing the parallel reads to the OSD's to satisfy it.
>>
>> 2. Kernel recognises that the readahead is bigger than the max_sectors_kb value and submits several IO's in parallel to the RBD device to satisfy the readahead request. Ie 32MB readahead would submit 8x4MB IO's in parallel.
>>
>> Please let me know if I have got the wrong idea here. But in my head either solution should improve sequential reads by a large amount, with the 2nd possibly slightly better as you are only waiting on the 1st OSD to respond to complete the request.
>>
>> Thanks for including the Ceph-Devel list, unfortunately despite several attempts I have not been able to post to this list after subscribing, please can you forward any correspondence if you think it would be useful to share.
>
> Did you remember to set max_sectors_kb to max_hw_sectors_kb?  The block
> layer in 3.14 leaves max_sectors_kb at 512, even when max_hw_sectors_kb
> is set to a much bigger value by the driver.  If you adjust it, you
> should be able to see object size requests, at least sometimes.  Note
> that you definitely won't see them all the time due to max_segments
> limitation, which was lifted only recently.

I just realized that what I wrote is true for O_DIRECT reads.  For page
cache driven reads, which is what we are discussing, the max_segments
limitation is killer - 128 pages = 512k.  The fix was a one-liner, but
I don't think it was submitted for stable.

The other thing is that 3.14.21+ kernels are just as screwed readahead
wise as 3.15+, as the offending commit was backported.  So even if
I submit the max_segments one-liner to stable and it makes it to say
3.14.52, we will still get only 4096 sector page cache I/Os, just like
you got on 4.1.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html