Re: Poor read performance.

Christian Balzer <chibi@xxxxxxx> · Thu, 26 Apr 2018 09:27:18 +0900



Hello,

On Wed, 25 Apr 2018 17:20:55 -0400 Jonathan Proulx wrote:

> On Wed Apr 25 02:24:19 PDT 2018 Christian Balzer wrote:
> 
> > Hello,  
> 
> > On Tue, 24 Apr 2018 12:52:55 -0400 Jonathan Proulx wrote:  
> 
> > > The performence I really care about is over rbd for VMs in my
> > > OpenStack but 'rbd bench' seems to line up frety well with 'fio' test
> > > inside VMs so a more or less typical random write rbd bench (from a
> > > monitor node with 10G connection on same net as osds):
> > >  
> 
> > "rbd bench" does things differently than fio (lots of happy switches
> > there) so to make absolutely sure you're not doing and apples and oranges
> > thing I'd suggest you stick to fio in a VM.  
> 
> There's some tradeoffs yes, but I get very close results and I figured
> ceph tools for ceph list rather than pulling in all the rest of my
> working stack, since the ceph toools do show the problem.
> 
> but I do see your point.
> 
> 
> > In comparison this fio:
> > ---
> > fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
> > ---  
> 
> > Will only result in this, due to the network latencies of having direct
> > I/O and only having one OSD at a time being busy:
> > ---
> >   write: io=110864KB, bw=1667.2KB/s, iops=416, runt= 66499msec
> > ---  
> 
> I may simply have under estimated the impact of write caching in
> libvirt, that fio commang does get me just abotu as crappy write
> performance as read (which would point to I just need more IOPS from
> more/faster disks which is definitely true to a greater or lesser
> extent.
> 
No caching with the direct flag, rather intentionally so.
But yes, these numbers are rather sad.

> WRITE: io=1024.0MB, aggrb=5705KB/s, minb=5705KB/s, maxb=5705KB/s,
>        mint=183789msec, maxt=183789msec
> 
> READ: io=1024.0MB, aggrb=4322KB/s, minb=4322KB/s, maxb=4322KB/s,
>       mint=242606msec, maxt=242606msec 
> 
> > That being said, something is rather wrong here indeed, my crappy test
> > cluster shouldn't be able to outperform yours.  
> 
> well load ... the asymetry was may main puzzlement but that may be illusory
> 
Yeah, those 1.7k VMs...

> > > rbd bench  --io-total=4G --io-size 4096 --io-type write \
> > > --io-pattern rand --io-threads 16 mypool/myvol
> > > 
> > > <snip />
> > > 
> > > elapsed:   361  ops:  1048576  ops/sec:  2903.82  bytes/sec: 11894034.98
> > > 
> > > same for random read is an order of magnitude lower:
> > > 
> > > rbd bench  --io-total=4G --io-size 4096 --io-type read \
> > > --io-pattern rand --io-threads 16  mypool/myvol
> > > 
> > > elapsed:  3354  ops:  1048576  ops/sec:   312.60  bytes/sec: 1280403.47
> > > 
> > > (sequencial reads and bigger io-size help but not a lot)
> > > 
> > > ceph -s from during read bench so get a sense of comparing traffic:
> > > 
> > >   cluster:
> > >     id:     <UUID>
> > >     health: HEALTH_OK
> > >  
> > >   services:
> > >     mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2
> > >     mgr: ceph-mon0(active), standbys: ceph-mon2, ceph-mon1
> > >     osd: 174 osds: 174 up, 174 in
> > >     rgw: 3 daemon active
> > >  
> > >   data:
> > >     pools:   19 pools, 10240 pgs
> > >     objects: 17342k objects, 80731 GB
> > >     usage:   240 TB used, 264 TB / 505 TB avail
> > >     pgs:     10240 active+clean
> > >  
> > >   io:
> > >     client:   4296 kB/s rd, 417 MB/s wr, 1635 op/s rd, 3518 op/s wr
> > > 
> > > 
> > > During deep-scrubs overnight I can see the disks doing >500MBps reads
> > > and ~150rx/iops (each at peak), while during read bench (including all
> > > traffic from ~1k VMs) individual osd data partitions peak around 25
> > > rx/iops and 1.5MBps rx bandwidth so it seems like there should be
> > > performance to spare.
> > >   
> > OK, there are a couple of things here.
> > 1k VMs?!?  
> 
> Actually 1.7k VMs just now, which caught me a bit by surprise when I
> looked at it.  Many are idle because we don't charge per use
> internally so people are sloppy, but many aren't and even the idle
> ones are writing logs and such.
> 
Indeed. 
And these writes often hit the same object and thus PG over and over, add
a little bad luck from the random fairy and some OSDs get hit much more
than others.

> > One assumes that they're not idle, looking at the output above.
> > And writes will compete with reads on the same spindle of course.
> > "performance to spare" you say, but have you verified this with iostat or
> > atop?  
> 
> Thsi assertion is mostly based on collectd stats that show a spike in
> read ops and bandwidth during our scrub window and no large change in
> write ops or bandwidth.  So I presume the disk *could* do that much
> (at least ops wise) for client traffic as well.
> 
> here's a snap of 24hr graph form one server (others are similar in
> general shape):
> 
> https://snapshot.raintank.io/dashboard/snapshot/gB3FDPl7uRGWmL17NHNBCuWKGsXdiqlt
> 
Thanks for that and what Blair said, you're hitting the end stops obviously
for spinning rust. 
Again, if you run atop with a 5s interval and large window you should be
seeing how busy those disks are and them likely hitting 90 to 100%
frequently.
Remember, one saturated OSD will bring down the performance of any PG that
references it, which often means the whole cluster.

> (link good for 7days)
> 
> You can clearly see the low read line behinf the higehr writes jump up
> during scrub window (20:00->02:00 local time here) and a much smaller
> bump around 6am cron.daily or the thundering herd. 
> 
Yar, I have several hundred VMs as well and the main app they're running
used to have an hourly cronjob. 
That got randomized for grate justice and much de-thundering. 

> The scrubs do impact performance which does mean I'm over capacity as
> I should be able to scrub and not impact production, but there's still
> a fair amount of capacity used during scrubbing that doesn't seem used
> outside.
> 
To be fair, most clusters can't do full throttle deep scrubs w/o impacting
things, but with appropriately tuned down parameters (same for recovery)
they are fine.
Also with bluestore the need for deep scrubs is somewhat diminished IMHO,
doing it less frequently and limiting it to low usage times is something
to consider.

> But looking harder the only answer may be "buy hardware" which is
> valid answer.
> 
More and/or better HW, yes. 

Regards,

Christian
> Thanks,
> -Jon
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com