Re: Kernel RBD Readahead

Ilya Dryomov <idryomov@xxxxxxxxx> · Mon, 24 Aug 2015 20:18:37 +0300

On Mon, Aug 24, 2015 at 7:00 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> -----Original Message-----
>> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
>> Sent: 24 August 2015 16:07
>> To: Nick Fisk <nick@xxxxxxxxxx>
>> Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>
>> Subject: Re: Kernel RBD Readahead
>>
>> On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov <idryomov@xxxxxxxxx>
>> wrote:
>> > On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >>> -----Original Message-----
>> >>> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
>> >>> Sent: 23 August 2015 18:33
>> >>> To: Nick Fisk <nick@xxxxxxxxxx>
>> >>> Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>
>> >>> Subject: Re: Kernel RBD Readahead
>> >>>
>> >>> On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >>> > Hi Ilya,
>> >>> >
>> >>> > I was wondering if I could just get your thoughts on a matter I
>> >>> > have run into?
>> >>> >
>> >>> > Its surrounding read performance of the RBD kernel client and
>> >>> > blk-mq, mainly when doing large single threaded reads. During
>> >>> > testing performance seems to be limited to around 40MB/s, which is
>> >>> > probably fairly similar to what you would expect to get from a
>> >>> > single OSD. This is to be expected as an RBD is just a long chain
>> >>> > of objects each on a different OSD which is being read through in order
>> one at a time.
>> >>> >
>> >>> > In theory readahead should make up for this by making the RBD
>> >>> > client read from several OSD’s ahead of the current required
>> >>> > block. However from what I can see it seems that setting a
>> >>> > readahead value higher than max_sectors_kb doesn’t appear to have
>> >>> > any effect, meaning that readahead is limited to each object that is
>> currently being read.
>> >>> > Would you be able to confirm if this is correct and if this is by design?
>> >>>
>> >>> [CCing ceph-devel]
>> >>>
>> >>> Certainly not by design.  rbd is just a block device driver, so if
>> >>> the kernel submits a readahead read, it will obey and carry it out in full.
>> >>> The readahead is driven by the VM in pages, it doesn't care about
>> >>> rbd object boundaries and such.
>> >>>
>> >>> That said, one problem is in the VM subsystem, where readaheads get
>> >>> capped at 512 pages (= 2M).  If you do a simple single threaded read
>> >>> test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os:
>> >>>
>> >>>     $ rbd info foo | grep order
>> >>>             order 24 (16384 kB objects)
>> >>>     $ blockdev --getra /dev/rbd0
>> >>>     32768
>> >>>     $ dd if=/dev/rbd0 of=/dev/null bs=32M
>> >>>     # avgrq-sz is 4096.00
>> >>>
>> >>> This was introduced in commit 6d2be915e589 ("mm/readahead.c: fix
>> >>> readahead failure for memoryless NUMA nodes and limit readahead
>> >>> pages") [1], which went into 3.15.  The hard limit was Linus' suggestion,
>> apparently.
>> >>>
>> >>> #define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
>> >>> /*
>> >>>  * Given a desired number of PAGE_CACHE_SIZE readahead pages,
>> return
>> >>> a
>> >>>  * sensible upper limit.
>> >>>  */
>> >>> unsigned long max_sane_readahead(unsigned long nr) {
>> >>>         return min(nr, MAX_READAHEAD); }
>> >>>
>> >>> This limit used to be dynamic and depended on the number of free
>> >>> pages in the system.  There has been an attempt to bring that
>> >>> behaviour back [2], but it didn't go very far as far as getting into
>> >>> mainline.  It looks like Red Hat and Oracle are shipping [2] in some
>> >>> of their kernels though.  If you apply it, you'll see 32768 sector (= 16M)
>> I/Os in the above test, which is how it should be.
>> >>>
>> >>> [1]
>> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/comm
>> >>> it/?id=6d 2be915e589b58cb11418cbe1f22ff90732b6ac
>> >>> [2] http://thread.gmane.org/gmane.linux.kernel/1893680
>> >>>
>> >>> One thing we should be doing is setting read_ahead_kb to object
>> >>> size, the default 128k doesn't really cut it for rbd.  I'll send a patch for
>> that.
>> >>>
>> >>> Thanks,
>> >>>
>> >>>                 Ilya
>> >>
>> >>
>> >> Thanks for your response.
>> >>
>> >> I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so that
>> is likely to be partly the cause of the poor performance I am seeing. However
>> I tried a 3.14 kernel and saw the same level performance, but this time the
>> IO's were limited to 1024 sectors. The queue depth was at around 8 so I
>> guess this means its submitting 8*512kb IO's up to the max_sectors_kb limit
>> of 4096KB. From the OSD point of view, this will still be accessing 1 OSD at a
>> time.
>> >>
>> >> Maybe I'm expecting the wrong results but I was expecting that either 1
>> of these 2 scenarios would happen.
>> >>
>> >> 1. Kernel submits a large enough IO to satisfy the readahead value,
>> max_sectors_kb would need to be higher than the object size (currently not
>> possible) and the RADOS layer would be responsible for doing the parallel
>> reads to the OSD's to satisfy it.
>> >>
>> >> 2. Kernel recognises that the readahead is bigger than the
>> max_sectors_kb value and submits several IO's in parallel to the RBD device
>> to satisfy the readahead request. Ie 32MB readahead would submit 8x4MB
>> IO's in parallel.
>> >>
>> >> Please let me know if I have got the wrong idea here. But in my head
>> either solution should improve sequential reads by a large amount, with the
>> 2nd possibly slightly better as you are only waiting on the 1st OSD to respond
>> to complete the request.
>> >>
>> >> Thanks for including the Ceph-Devel list, unfortunately despite several
>> attempts I have not been able to post to this list after subscribing, please can
>> you forward any correspondence if you think it would be useful to share.
>> >
>> > Did you remember to set max_sectors_kb to max_hw_sectors_kb?  The
>> > block layer in 3.14 leaves max_sectors_kb at 512, even when
>> > max_hw_sectors_kb is set to a much bigger value by the driver.  If you
>> > adjust it, you should be able to see object size requests, at least
>> > sometimes.  Note that you definitely won't see them all the time due
>> > to max_segments limitation, which was lifted only recently.
>
> Yes I made sure I have checked this on all Kernels I have tested.
>
>>
>> I just realized that what I wrote is true for O_DIRECT reads.  For page cache
>> driven reads, which is what we are discussing, the max_segments limitation
>> is killer - 128 pages = 512k.  The fix was a one-liner, but I don't think it was
>> submitted for stable.
>
> I'm also seeing this, which is probably not helping, but see below :-)
>
>>
>> The other thing is that 3.14.21+ kernels are just as screwed readahead wise
>> as 3.15+, as the offending commit was backported.  So even if I submit the
>> max_segments one-liner to stable and it makes it to say 3.14.52, we will still
>> get only 4096 sector page cache I/Os, just like you got on 4.1.
>>
>
> Yes, this was the problem, apologies I should have checked to see if the patch was back ported. I tried an older 3.14 release and I am now seeing up to 200MB/s, a massive improvement. Queue depths are hovering around the 50 mark which is what I would expect. I suspect with the max_segments fix it would possibly go faster as the requests to the OSD would be larger.
>
> In that thread you linked, Linus asked for a real work example where that commit was causing problems, here is one.
>
> "Using a Ceph RBD as a staging area before trying to stream to tape, which needs to average around 150-200MB/s to be suitable"
>
> Where do we go from here to try and get the behaviour modified upstream?

Can you describe your use case in more detail, in a cut & pasteable
way?

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html