Re: Poor read performance.

Christian Balzer <chibi@xxxxxxx> · Wed, 25 Apr 2018 18:24:19 +0900

Hello,

On Tue, 24 Apr 2018 12:52:55 -0400 Jonathan Proulx wrote:

> Hi All,
> 
> I seem to be seeing consitently poor read performance on my cluster
> relative to both write performance and read perormance of a single
> backend disk, by quite a lot.
> 
> cluster is luminous with 174 7.2k SAS drives across 12 storage servers
> with 10G ethernet and jumbo frames.  Drives are mix 4T and 2T
> bluestore with DB on ssd.
> 
How much RAM do these hosts have and have you changed the default cache
settings of bluestore?

> The performence I really care about is over rbd for VMs in my
> OpenStack but 'rbd bench' seems to line up frety well with 'fio' test
> inside VMs so a more or less typical random write rbd bench (from a
> monitor node with 10G connection on same net as osds):
>

"rbd bench" does things differently than fio (lots of happy switches
there) so to make absolutely sure you're not doing and apples and oranges
thing I'd suggest you stick to fio in a VM.

For example your write "rdb bench" on my crap test cluster with 6 nodes
and 4 ancient SATA drives each and 1Gb/s links (but luminous and bluestore)
will get all the HDDs nearly 100% busy:
---
elapsed:   152  ops:   262144  ops/sec:  1715.01  bytes/sec: 7024699.12
---

In comparison this fio:
---
fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
---

Will only result in this, due to the network latencies of having direct
I/O and only having one OSD at a time being busy:
---
  write: io=110864KB, bw=1667.2KB/s, iops=416, runt= 66499msec
---

The fio result is a very important one, as IOPS for a single client won't
get faster than this, the bench one is good to look for what the upper
limit of the cluster is.

On contrary, the reads look like this for bench:
---
elapsed:    40  ops:   262144  ops/sec:  6430.08  bytes/sec: 26337591.86
---

and this for fio:
---
  read : io=1024.0MB, bw=86717KB/s, iops=21679, runt= 12092msec
---

The later being served clearly from the OSD caches, very visible with atop
on the OSD hosts.

That being said, something is rather wrong here indeed, my crappy test
cluster shouldn't be able to outperform yours.

> rbd bench  --io-total=4G --io-size 4096 --io-type write \
> --io-pattern rand --io-threads 16 mypool/myvol
> 
> <snip />
> 
> elapsed:   361  ops:  1048576  ops/sec:  2903.82  bytes/sec: 11894034.98
> 
> same for random read is an order of magnitude lower:
> 
> rbd bench  --io-total=4G --io-size 4096 --io-type read \
> --io-pattern rand --io-threads 16  mypool/myvol
> 
> elapsed:  3354  ops:  1048576  ops/sec:   312.60  bytes/sec: 1280403.47
> 
> (sequencial reads and bigger io-size help but not a lot)
> 
> ceph -s from during read bench so get a sense of comparing traffic:
> 
>   cluster:
>     id:     <UUID>
>     health: HEALTH_OK
>  
>   services:
>     mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2
>     mgr: ceph-mon0(active), standbys: ceph-mon2, ceph-mon1
>     osd: 174 osds: 174 up, 174 in
>     rgw: 3 daemon active
>  
>   data:
>     pools:   19 pools, 10240 pgs
>     objects: 17342k objects, 80731 GB
>     usage:   240 TB used, 264 TB / 505 TB avail
>     pgs:     10240 active+clean
>  
>   io:
>     client:   4296 kB/s rd, 417 MB/s wr, 1635 op/s rd, 3518 op/s wr
> 
> 
> During deep-scrubs overnight I can see the disks doing >500MBps reads
> and ~150rx/iops (each at peak), while during read bench (including all
> traffic from ~1k VMs) individual osd data partitions peak around 25
> rx/iops and 1.5MBps rx bandwidth so it seems like there should be
> performance to spare.
> 
OK, there are a couple of things here.
1k VMs?!?
One assumes that they're not idle, looking at the output above.
And writes will compete with reads on the same spindle of course.
"performance to spare" you say, but have you verified this with iostat or
atop?

> Obviosly given my disk choices this isn't designed as a particularly
> high performance setup but I do expect a bit mroe performance out of
> it.
> 
> Are my expectations wrong? If not any clues what I've don (or failed
> to do) that is wrong?
> 
> Pretty sure rx/wx was much more sysmetric in earlier versions (subset
> of same hardware and filestore backend) but used a different perf tool
> so don't want to make direct comparisons.
>

It could be as easy as having lots of pagecache with filestore that
helped dramatically with (repeated) reads.  

But w/o a quiescent cluster determining things might be difficult.

Christian

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com