On Mon, Aug 24, 2015 at 7:00 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> -----Original Message----- >> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx] >> Sent: 24 August 2015 16:07 >> To: Nick Fisk <nick@xxxxxxxxxx> >> Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx> >> Subject: Re: Kernel RBD Readahead >> >> On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov <idryomov@xxxxxxxxx> >> wrote: >> > On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> >>> -----Original Message----- >> >>> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx] >> >>> Sent: 23 August 2015 18:33 >> >>> To: Nick Fisk <nick@xxxxxxxxxx> >> >>> Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx> >> >>> Subject: Re: Kernel RBD Readahead >> >>> >> >>> On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> >>> > Hi Ilya, >> >>> > >> >>> > I was wondering if I could just get your thoughts on a matter I >> >>> > have run into? >> >>> > >> >>> > Its surrounding read performance of the RBD kernel client and >> >>> > blk-mq, mainly when doing large single threaded reads. During >> >>> > testing performance seems to be limited to around 40MB/s, which is >> >>> > probably fairly similar to what you would expect to get from a >> >>> > single OSD. This is to be expected as an RBD is just a long chain >> >>> > of objects each on a different OSD which is being read through in order >> one at a time. >> >>> > >> >>> > In theory readahead should make up for this by making the RBD >> >>> > client read from several OSD’s ahead of the current required >> >>> > block. However from what I can see it seems that setting a >> >>> > readahead value higher than max_sectors_kb doesn’t appear to have >> >>> > any effect, meaning that readahead is limited to each object that is >> currently being read. >> >>> > Would you be able to confirm if this is correct and if this is by design? >> >>> >> >>> [CCing ceph-devel] >> >>> >> >>> Certainly not by design. rbd is just a block device driver, so if >> >>> the kernel submits a readahead read, it will obey and carry it out in full. >> >>> The readahead is driven by the VM in pages, it doesn't care about >> >>> rbd object boundaries and such. >> >>> >> >>> That said, one problem is in the VM subsystem, where readaheads get >> >>> capped at 512 pages (= 2M). If you do a simple single threaded read >> >>> test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os: >> >>> >> >>> $ rbd info foo | grep order >> >>> order 24 (16384 kB objects) >> >>> $ blockdev --getra /dev/rbd0 >> >>> 32768 >> >>> $ dd if=/dev/rbd0 of=/dev/null bs=32M >> >>> # avgrq-sz is 4096.00 >> >>> >> >>> This was introduced in commit 6d2be915e589 ("mm/readahead.c: fix >> >>> readahead failure for memoryless NUMA nodes and limit readahead >> >>> pages") [1], which went into 3.15. The hard limit was Linus' suggestion, >> apparently. >> >>> >> >>> #define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) >> >>> /* >> >>> * Given a desired number of PAGE_CACHE_SIZE readahead pages, >> return >> >>> a >> >>> * sensible upper limit. >> >>> */ >> >>> unsigned long max_sane_readahead(unsigned long nr) { >> >>> return min(nr, MAX_READAHEAD); } >> >>> >> >>> This limit used to be dynamic and depended on the number of free >> >>> pages in the system. There has been an attempt to bring that >> >>> behaviour back [2], but it didn't go very far as far as getting into >> >>> mainline. It looks like Red Hat and Oracle are shipping [2] in some >> >>> of their kernels though. If you apply it, you'll see 32768 sector (= 16M) >> I/Os in the above test, which is how it should be. >> >>> >> >>> [1] >> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/comm >> >>> it/?id=6d 2be915e589b58cb11418cbe1f22ff90732b6ac >> >>> [2] http://thread.gmane.org/gmane.linux.kernel/1893680 >> >>> >> >>> One thing we should be doing is setting read_ahead_kb to object >> >>> size, the default 128k doesn't really cut it for rbd. I'll send a patch for >> that. >> >>> >> >>> Thanks, >> >>> >> >>> Ilya >> >> >> >> >> >> Thanks for your response. >> >> >> >> I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so that >> is likely to be partly the cause of the poor performance I am seeing. However >> I tried a 3.14 kernel and saw the same level performance, but this time the >> IO's were limited to 1024 sectors. The queue depth was at around 8 so I >> guess this means its submitting 8*512kb IO's up to the max_sectors_kb limit >> of 4096KB. From the OSD point of view, this will still be accessing 1 OSD at a >> time. >> >> >> >> Maybe I'm expecting the wrong results but I was expecting that either 1 >> of these 2 scenarios would happen. >> >> >> >> 1. Kernel submits a large enough IO to satisfy the readahead value, >> max_sectors_kb would need to be higher than the object size (currently not >> possible) and the RADOS layer would be responsible for doing the parallel >> reads to the OSD's to satisfy it. >> >> >> >> 2. Kernel recognises that the readahead is bigger than the >> max_sectors_kb value and submits several IO's in parallel to the RBD device >> to satisfy the readahead request. Ie 32MB readahead would submit 8x4MB >> IO's in parallel. >> >> >> >> Please let me know if I have got the wrong idea here. But in my head >> either solution should improve sequential reads by a large amount, with the >> 2nd possibly slightly better as you are only waiting on the 1st OSD to respond >> to complete the request. >> >> >> >> Thanks for including the Ceph-Devel list, unfortunately despite several >> attempts I have not been able to post to this list after subscribing, please can >> you forward any correspondence if you think it would be useful to share. >> > >> > Did you remember to set max_sectors_kb to max_hw_sectors_kb? The >> > block layer in 3.14 leaves max_sectors_kb at 512, even when >> > max_hw_sectors_kb is set to a much bigger value by the driver. If you >> > adjust it, you should be able to see object size requests, at least >> > sometimes. Note that you definitely won't see them all the time due >> > to max_segments limitation, which was lifted only recently. > > Yes I made sure I have checked this on all Kernels I have tested. > >> >> I just realized that what I wrote is true for O_DIRECT reads. For page cache >> driven reads, which is what we are discussing, the max_segments limitation >> is killer - 128 pages = 512k. The fix was a one-liner, but I don't think it was >> submitted for stable. > > I'm also seeing this, which is probably not helping, but see below :-) > >> >> The other thing is that 3.14.21+ kernels are just as screwed readahead wise >> as 3.15+, as the offending commit was backported. So even if I submit the >> max_segments one-liner to stable and it makes it to say 3.14.52, we will still >> get only 4096 sector page cache I/Os, just like you got on 4.1. >> > > Yes, this was the problem, apologies I should have checked to see if the patch was back ported. I tried an older 3.14 release and I am now seeing up to 200MB/s, a massive improvement. Queue depths are hovering around the 50 mark which is what I would expect. I suspect with the max_segments fix it would possibly go faster as the requests to the OSD would be larger. > > In that thread you linked, Linus asked for a real work example where that commit was causing problems, here is one. > > "Using a Ceph RBD as a staging area before trying to stream to tape, which needs to average around 150-200MB/s to be suitable" > > Where do we go from here to try and get the behaviour modified upstream? Can you describe your use case in more detail, in a cut & pasteable way? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html