On Tue, Aug 25, 2015 at 5:05 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> -----Original Message----- >> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx] >> Sent: 25 August 2015 09:45 >> To: Nick Fisk <nick@xxxxxxxxxx> >> Cc: Ceph Development <ceph-devel@xxxxxxxxxxxxxxx> >> Subject: Re: Kernel RBD Readahead >> >> On Tue, Aug 25, 2015 at 10:40 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> > I have done two tests one with 1MB objects and another with 4MB objects, >> my cluster is a little busier than when I did the quick test yesterday, so all >> speeds are slightly down across the board but you can see the scaling effect >> nicely. Results: >> > >> > 1MB Order RBD >> > Readahead DD->Null Speed >> > 128 18MB/s >> > 1024 32MB/s >> > 2048 40MB/s >> > 4096 58MB/s >> > 8192 75MB/s >> > 16384 91MB/s >> > 32768 160MB/s >> > >> > 4MB Order RBD >> > 128 42MB/s >> > 1024 56MB/s >> > 2048 61MB/s >> > 4096 98MB/s >> > 8192 121MB/s >> > 16384 170MB/s >> > 32768 195MB/s >> > 65536 221MB/s >> > 131072 271MB/s >> > >> > I think the results confirm my suspicions, where a full stripe in a raid array >> will usually only be a couple of MB (eg 256kb chunk * 8 disks) and so a >> relatively small readahead will involve all the disks for max performance. In a >> Ceph RBD a full stripe will be 4MB * number of OSD's in the cluster. So I think >> that if sequential read performance is the only goal, then readahead >> probably needs to equal that figure, which could be massive. But in reality >> like me you will probably find that you get sufficient performance at a lower >> value. Of course all this theory could all change when the kernel client gets >> striping support. >> > >> > However in terms of a default, that’s a tricky one. Even setting it to 4096 >> would probably start to have a negative impact on pure random IO latency. >> Each read would make a OSD read a whole 4MB object, see small table below >> for IOP/Read size for disks in my cluster. I would imagine somewhere >> between 256-1024 would be a good trade off between where the OSD disks >> latency starts to rise. Users would need to be aware of their workload and >> tweak readahead if needed. >> > >> > IOPs >> > (4k Random Read) 83 >> > (64k Random Read) 81 >> > (256k Random Read) 73 >> > (1M Random Read) 52 >> > (4M Random Read) 25 >> >> Yeah, we want a sensible default, but it's always going to be a trade off. >> librbd has readahead knobs, but the only real use case there is shortening >> qemu boot times, so we can't copy those settings. I'll have to think about it >> some more - it might make more sense to leave things as is. Users with large >> sequential read workloads should know to check and adjust readahead >> settings, and likely won't be satisfied with 1x object size anyway. > > Ok. I might try and create a 4.1 kernel with the blk-mq queue depth/IO size + readahead +max_segments fixes in as I'm think the TCP_NODELAY bug will still be present in my old 3.14 kernel. I can build 4.2-rc8 + readahead patch on gitbuilders for you. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html