Reply in text > On 18 Aug 2015, at 12:59, Nick Fisk <nick@xxxxxxxxxx> wrote: > > > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >> Jan Schermer >> Sent: 18 August 2015 11:50 >> To: Benedikt Fraunhofer <given.to.lists.ceph- >> users.ceph.com.toasta.001@xxxxxxxxxx> >> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx> >> Subject: Re: How to improve single thread sequential reads? >> >> I'm not sure if I missed that but are you testing in a VM backed by RBD >> device, or using the device directly? >> >> I don't see how blk-mq would help if it's not a VM, it just passes the > request >> to the underlying block device, and in case of RBD there is no real block >> device from the host perspective...? Enlighten me if I'm wrong please. I > have >> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me >> cringe because I'm unable to tune the scheduler and it just makes no sense >> at all...? > > Since 4.0 (I think) the Kernel RBD client now uses the blk-mq > infrastructure, but there is a bug which limits max IO sizes to 128kb, which > is why for large block/sequential that testing kernel is essential. I think > this bug fix should make it to 4.2 hopefully. blk-mq is supposed to remove redundancy of having IO scheduler in VM -> VM block device -> host IO scheduler -> block device it's a paravirtualized driver that just moves requests from inside the VM to the host queue (and this is why inside the VM you have no IO scheduler options - it effectively becomes noop). But this just doesn't make sense if you're using qemu with librbd - there's no host queue. It would make sense if the qemu drive was krbd device with a queue. If there's no VM there should be no blk-mq? So what was added to the kernel was probably the host-side infrastructure to handle blk-mq in guest passthrough to the krdb device, but that's probably not your case, is it? > >> >> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to >> make sure it gets into readahead), also try (if you're not using blk-mq) > to a >> cfq scheduler and set it to rotational=1. I see you've also tried this, > but I think >> blk-mq is the limiting factor here now. > > I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object > size, from what I can tell) and the max_sectors_kb is already set at the > hw_max. But it would sure be nice if the max_hw_sectors_kb could be set > higher though, but I'm not sure if there is a reason for this limit. > >> >> If you are running a single-threaded benchmark like rados bench then > what's >> limiting you is latency - it's not surprising it scales up with more > threads. > > Agreed, but with sequential workloads, if you can get readahead working > properly then you can easily remove this limitation as a single threaded op > effectively becomes multithreaded. Thinking on this more - I don't know if this will help after all, it will still be a single thread, just trying to get ahead of the client IO - and that's not likely to happen unless you can read the data in userspace slower than what Ceph can read... I think striping multiple device could be the answer after all. But have you tried creating the RBD volume as striped in Ceph? > >> It should run nicely with a real workload once readahead kicks in and the >> queue fills up. But again - not sure how that works with blk-mq and I've >> never used the RBD device directly (the kernel client). Does it show in >> /sys/block ? Can you dump "find /sys/block/$rbd" in here? >> >> Jan >> >> >>> On 18 Aug 2015, at 12:25, Benedikt Fraunhofer <given.to.lists.ceph- >> users.ceph.com.toasta.001@xxxxxxxxxx> wrote: >>> >>> Hi Nick, >>> >>> did you do anything fancy to get to ~90MB/s in the first place? >>> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are >>> quite speedy, around 600MB/s. >>> >>> radosgw for cold data is around the 90MB/s, which is imho limitted by >>> the speed of a single disk. >>> >>> Data already present on the osd-os-buffers arrive with around >>> 400-700MB/s so I don't think the network is the culprit. >>> >>> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds >>> each, lacp 2x10g bonds) >>> >>> rados bench single-threaded performs equally bad, but with its default >>> multithreaded settings it generates wonderful numbers, usually only >>> limiited by linerate and/or interrupts/s. >>> >>> I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to >>> get to "your wonderful" numbers, but it's staying below 30 MB/s. >>> >>> I was thinking about using a software raid0 like you did but that's >>> imho really ugly. >>> When I know I needed something speedy, I usually just started dd-ing >>> the file to /dev/null and wait for about three minutes before >>> starting the actual job; some sort of hand-made read-ahead for >>> dummies. >>> >>> Thx in advance >>> Benedikt >>> >>> >>> 2015-08-17 13:29 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>: >>>> Thanks for the replies guys. >>>> >>>> The client is set to 4MB, I haven't played with the OSD side yet as I >>>> wasn't sure if it would make much difference, but I will give it a >>>> go. If the client is already passing a 4MB request down through to >>>> the OSD, will it be able to readahead any further? The next 4MB >>>> object in theory will be on another OSD and so I'm not sure if >>>> reading ahead any further on the OSD side would help. >>>> >>>> How I see the problem is that the RBD client will only read 1 OSD at >>>> a time as the RBD readahead can't be set any higher than >>>> max_hw_sectors_kb, which is the object size of the RBD. Please correct >> me if I'm wrong on this. >>>> >>>> If you could set the RBD readahead to much higher than the object >>>> size, then this would probably give the desired effect where the >>>> buffer could be populated by reading from several OSD's in advance to >>>> give much higher performance. That or wait for striping to appear in > the >> Kernel client. >>>> >>>> I've also found that BareOS (fork of Bacula) seems to has a direct >>>> RADOS feature that supports radosstriper. I might try this and see >>>> how it performs as well. >>>> >>>> >>>>> -----Original Message----- >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>>> Behalf Of Somnath Roy >>>>> Sent: 17 August 2015 03:36 >>>>> To: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>; Nick Fisk >>>>> <nick@xxxxxxxxxx> >>>>> Cc: ceph-users@xxxxxxxxxxxxxx >>>>> Subject: Re: How to improve single thread sequential >> reads? >>>>> >>>>> Have you tried setting read_ahead_kb to bigger number for both >>>>> client/OSD side if you are using krbd ? >>>>> In case of librbd, try the different config options for rbd cache.. >>>>> >>>>> Thanks & Regards >>>>> Somnath >>>>> >>>>> -----Original Message----- >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>>> Behalf Of Alex Gorbachev >>>>> Sent: Sunday, August 16, 2015 7:07 PM >>>>> To: Nick Fisk >>>>> Cc: ceph-users@xxxxxxxxxxxxxx >>>>> Subject: Re: How to improve single thread sequential >> reads? >>>>> >>>>> Hi Nick, >>>>> >>>>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >>>>>>> -----Original Message----- >>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>>>>> Behalf Of Nick Fisk >>>>>>> Sent: 13 August 2015 18:04 >>>>>>> To: ceph-users@xxxxxxxxxxxxxx >>>>>>> Subject: How to improve single thread sequential reads? >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm trying to use a RBD to act as a staging area for some data >>>>>>> before >>>>>> pushing >>>>>>> it down to some LTO6 tapes. As I cannot use striping with the >>>>>>> kernel >>>>>> client I >>>>>>> tend to be maxing out at around 80MB/s reads testing with DD. Has >>>>>>> anyone got any clever suggestions of giving this a bit of a boost, >>>>>>> I think I need >>>>>> to get it >>>>>>> up to around 200MB/s to make sure there is always a steady flow of >>>>>>> data to the tape drive. >>>>>> >>>>>> I've just tried the testing kernel with the blk-mq fixes in it for >>>>>> full size IO's, this combined with bumping readahead up to 4MB, is >>>>>> now getting me on average 150MB/s to 200MB/s so this might suffice. >>>>>> >>>>>> On a personal interest, I would still like to know if anyone has >>>>>> ideas on how to really push much higher bandwidth through a RBD. >>>>> >>>>> Some settings in our ceph.conf that may help: >>>>> >>>>> osd_op_threads = 20 >>>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k >>>>> filestore_queue_max_ops = 90000 filestore_flusher = false >>>>> filestore_max_sync_interval = 10 filestore_sync_flush = false >>>>> >>>>> Regards, >>>>> Alex >>>>> >>>>>> >>>>>>> >>>>>>> Rbd-fuse seems to top out at 12MB/s, so there goes that option. >>>>>>> >>>>>>> I'm thinking mapping multiple RBD's and then combining them into a >>>>>>> mdadm >>>>>>> RAID0 stripe might work, but seems a bit messy. >>>>>>> >>>>>>> Any suggestions? >>>>>>> >>>>>>> Thanks, >>>>>>> Nick >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> ________________________________ >>>>> >>>>> PLEASE NOTE: The information contained in this electronic mail >>>>> message is intended only for the use of the designated recipient(s) >>>>> named above. If >>>> the >>>>> reader of this message is not the intended recipient, you are hereby >>>> notified >>>>> that you have received this message in error and that any review, >>>>> dissemination, distribution, or copying of this message is strictly >>>> prohibited. If >>>>> you have received this communication in error, please notify the >>>>> sender by telephone or e-mail (as shown above) immediately and >>>>> destroy any and all copies of this message in your possession >>>>> (whether hard copies or electronically stored copies). >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com