Re: Kernel RBD Readahead

Ilya Dryomov <idryomov@xxxxxxxxx> · Tue, 25 Aug 2015 17:50:03 +0300

On Tue, Aug 25, 2015 at 5:05 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> -----Original Message-----
>> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
>> Sent: 25 August 2015 09:45
>> To: Nick Fisk <nick@xxxxxxxxxx>
>> Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>
>> Subject: Re: Kernel RBD Readahead
>>
>> On Tue, Aug 25, 2015 at 10:40 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> > I have done two tests one with 1MB objects and another with 4MB objects,
>> my cluster is a little busier than when I did the quick test yesterday, so all
>> speeds are slightly down across the board but you can see the scaling effect
>> nicely. Results:
>> >
>> > 1MB Order RBD
>> > Readahead       DD->Null Speed
>> > 128             18MB/s
>> > 1024            32MB/s
>> > 2048            40MB/s
>> > 4096            58MB/s
>> > 8192            75MB/s
>> > 16384           91MB/s
>> > 32768           160MB/s
>> >
>> > 4MB Order RBD
>> > 128             42MB/s
>> > 1024            56MB/s
>> > 2048            61MB/s
>> > 4096            98MB/s
>> > 8192            121MB/s
>> > 16384           170MB/s
>> > 32768           195MB/s
>> > 65536           221MB/s
>> > 131072          271MB/s
>> >
>> > I think the results confirm my suspicions, where a full stripe in a raid array
>> will usually only be a couple of MB (eg 256kb chunk * 8 disks) and so a
>> relatively small readahead will involve all the disks for max performance. In a
>> Ceph RBD a full stripe will be 4MB * number of OSD's in the cluster. So I think
>> that if sequential read performance is the only goal, then readahead
>> probably needs to equal that figure, which could be massive. But in reality
>> like me you will probably find that  you get sufficient performance at a lower
>> value. Of course all this theory could all change when the kernel client gets
>> striping support.
>> >
>> > However in terms of a default, that’s a tricky one. Even setting it to 4096
>> would probably start to have a negative impact on pure random IO latency.
>> Each read would make a OSD read a whole 4MB object, see small table below
>> for IOP/Read size for disks in my cluster. I would imagine somewhere
>> between 256-1024 would be a good trade off between where the OSD disks
>> latency starts to rise. Users would need to be aware of their workload and
>> tweak readahead if needed.
>> >
>> >                         IOPs
>> > (4k Random Read)        83
>> > (64k Random Read)       81
>> > (256k Random Read)      73
>> > (1M Random Read)        52
>> > (4M Random Read)        25
>>
>> Yeah, we want a sensible default, but it's always going to be a trade off.
>> librbd has readahead knobs, but the only real use case there is shortening
>> qemu boot times, so we can't copy those settings.  I'll have to think about it
>> some more - it might make more sense to leave things as is.  Users with large
>> sequential read workloads should know to check and adjust readahead
>> settings, and likely won't be satisfied with 1x object size anyway.
>
> Ok. I might try and create a 4.1 kernel with the blk-mq queue depth/IO size + readahead +max_segments fixes in as I'm think the TCP_NODELAY bug will still be present in my old 3.14 kernel.

I can build 4.2-rc8 + readahead patch on gitbuilders for you.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html