Re: Having trouble getting good performance

Nick Fisk <nick@xxxxxxxxxx> · Fri, 24 Apr 2015 22:06:43 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> J David
> Sent: 24 April 2015 18:41
> To: Nick Fisk
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Having trouble getting good performance
> 
> On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size
> > increases the number of iops will start to fall. You will probably get
> > around 70 iops for 128kb. But please benchmark your raw disks to get
> > some accurate numbers if needed.
> >
> > Next when you use on-disk journals you write 1st to the journal and
> > then write the actual data. There is also a small levelDB write which
> > stores ceph metadata so depending on IO size you will get slightly
> > less than half the native disk performance.
> >
> > You then have 2 copies, as Ceph won't ACK until both copies have been
> > written the average latency will tend to stray upwards.
> 
> What is the purpose of the journal if Ceph waits for the actual write to
> complete anyway?

The Ceph documentation does a far better job of explaining it than I ever
could
http://ceph.com/docs/master/rados/configuration/journal-ref/

> 
> I.e. with a hardware raid card with a BBU, the raid card tells the host
that the
> data is guaranteed safe as soon as it has been written to the BBU.

Yep exactly, the same with Ceph journals, but with a BBU you have cache
memory which latency is measured in microseconds, whereas with journals on
disks, you have latency measured in 10's of milliseconds. That's why SSD
journals are recommended, they effectively act as BBU's for each disk.

As a side note, the other option is to get a RAID controller that still uses
its BBU in JBOD mode, this gives similar performance to SSD journals.

> 
> Does this also mean that all the writing internal to ceph happens
> synchronously?
> 
> I.e. all these operations are serialized:
> 
> copy1-journal-write -> copy1-data-write -> copy2-journal-write -> copy2-
> data-write -> OK, client, you're done.
> 
> Since copy1 and copy2 are on completely different physical hardware,
> shouldn't those operations be able to proceed more or less independently?
> And shouldn't the client be done as soon as the journal is written?  I.e.:
> 
> copy1-journal-write -v- copy1-data-write copy2-journal-write -|-
copy1-data-
> write
>                              +-> OK, client, you're done
> 
> If so, shouldn't the effective latency be that of one operation, not four?
Plus
> all the non-trivial overhead for scheduling, LevelDB, network latency,
etc.
> 

Yes writing to all the copies does happen concurrently, but there is still a
penalty to pay. When you perform a write you contact the primary OSD, which
then informs the replica(s) to perform the write as well. Whilst they happen
concurrently one still starts slightly after the other and so latency rises.
You also need to consider the fact that for each additional copy, the
average time that the disk heads have to wait for the spindle to be in the
right position also goes up. It's the same with RAID1, write latency will
always be slightly worse than a single disk.

And yes the client will receive acknowledgement as soon as the data is on
the journal. But taking an average over a few seconds, the disk might also
be flushing the journal to disk, so you see effectively half the
performance.

Don't underestimate the latency in the network and ceph itself, which whilst
making a minimal impact on performance with disks it's still relevant. See
the difference of ping times from 64 bytes payload and 64000 bytes and then
consider that you have 2 network hops (client->OSD1->OSD2)....it all adds
up.

> For the "getting jackhammered by zillions of clients" case, your estimate
> probably holds more true, because even if writes aren't in the critical
path
> they still happen and sooner or later the drive runs out of IOPs and
things
> start getting in each others' way.  But for a single client, single thread
case
> where the cluster is *not* 100% utilized, shouldn't the effective latency
be
> much less?

Don't take this as gospel, benchmarks will give you a more accurate answer.
A Single threaded operation to a cluster with replica size of 2 and
collocated journals will probably see an average latency somewhere in the
region of 20-30ms which is around 30-50 iops. However these figures can and
will change based on the io profile passed to the cluster. If you search for
the Cern presentation, they were using 5900RPM disks without SSD journals
and where seeing something like 40ms average latency on writes and they had
1000's of disks.

> 
> The other thing about this that I don't quite understand, and the thing
initially
> had me questioning whether there was something wrong on the Ceph side is
> that your estimate is based primarily on the mechanical capabilities of
the
> drives.  Yet, in practice, when the Ceph cluster is tapped out for I/O in
this
> situation, iostat says none of the physical drives are more than 10-20%
busy
> and doing 10-20 IOPs to write a couple of MB/sec.  And those are the
> "loaded" ones at any given time.  Many are <10%.  In fact, *none* of the
> hardware on the Ceph side is anywhere close to fully utilized.  If the
> performance of this cluster is limited by its hardware, shouldn't there be
> some evidence of that somewhere?
> 

Ok this is interesting. Can I just confirm is this during the Fio with
IOdepth=64, or with the ZFS Receives?

Did you manage to get the Fio with RBD installed? Performing benchmarks
directly into Ceph may reveal something.

> To illustrate, I marked a physical drive out and waited for things to
settle
> down, then ran fio on the physical drive (128KB randwrite
> numjobs=1 iodepth=1).  It yields a very different picture of the drive's
> physical limits.
> 
> The drive during "maxxed out" client writes:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdl               0.00     0.20    4.80   13.40    23.60  2505.65
> 277.94     0.26   14.07   16.08   13.34   6.68  12.16
> 
> The same drive under fio:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdl               0.00     0.00    0.00  377.50     0.00 48320.00
> 256.00     0.99    2.62    0.00    2.62   2.62  98.72
> 

Are you running the test against a file, a small region of the disk or the
whole device, if it's the later, you have a very fast disk!!! If it's just a
file or you have run with "size=X" you are probably seeing the effects of
"short stroking" on the disk where the heads are stationary and not having
to seek across the disk. If you run Fio with libaio direct=1 and point it to
the disk device you will get a truer picture of disk performance. But be
warned this will overwrite the disks contents. 

If it's useful here are the Fio Random Read results I did when I was
building my cluster, disk is a 3TB WD Red Pro 7.2k
						IOPs
Raw disk performance (4k Random Read)	83
Raw disk performance (64k Random Read)	81
Raw disk performance (256k Random Read)	73
Raw disk performance (1M Random Read)	52
Raw disk performance (4M Random Read)	25

> You could make the argument that for we are seeing half the throughput on
> the same test because ceph is write-doubling (journal+data) and the reason
> no drive is highly utilized is because the load is being spread out.  So
each of
> 28 drives actually is being maxed out, but only 3.5% of the time, leading
to
> low apparent utilization because the measurement interval is too long.
And
> maybe that is exactly what is happening.  For that to be true, the two OSD
> writes would indeed have to happen in parallel, not sequentially.  (Which
is
> what it's supposed to do, I believe?)

> 
> But why does a client have to wait for both writes?  Isn't the journal
enough?
> If it isn't, shouldn't it be?  And if it isn't, wouldn't moving to even an
infinitely
> fast SSD journal only double the performance, since the second write still
has
> to happen?

It's not that the client has to wait for both writes, but the fact there
maybe 2 writes happening, halves the performance potential of the disk.

Yes and no, the journal helps to coalesce the writes so that the HDD can
write larger blocks at higher queue depths. In theory you can send in a
single threaded stream of 4kb io's to the journal (SDD or collocated) and
then the journal flushes more efficiently to the HDD. In the collocated case
though, the disk heads end up constantly flicking between the journal and
the data part of the disk.

> 
> In case they are of interest, the native drive fio results are below.
> 
> testfile: (groupid=0, jobs=1): err= 0: pid=20562
>   write: io=30720MB, bw=47568KB/s, iops=371 , runt=661312msec
>     slat (usec): min=13 , max=4087 , avg=34.08, stdev=25.36
>     clat (usec): min=2 , max=736605 , avg=2650.22, stdev=6368.02
>      lat (usec): min=379 , max=736640 , avg=2684.80, stdev=6368.00
>     clat percentiles (usec):
>      |  1.00th=[  466],  5.00th=[ 1576], 10.00th=[ 1800], 20.00th=[ 1992],
>      | 30.00th=[ 2128], 40.00th=[ 2224], 50.00th=[ 2320], 60.00th=[ 2416],
>      | 70.00th=[ 2512], 80.00th=[ 2640], 90.00th=[ 2864], 95.00th=[ 3152],
>      | 99.00th=[10688], 99.50th=[20352], 99.90th=[29056], 99.95th=[29568],
>      | 99.99th=[452608]
>     bw (KB/s)  : min= 1022, max=88910, per=100.00%, avg=47982.41,
> stdev=7115.74
>     lat (usec) : 4=0.01%, 500=1.52%, 750=1.23%, 1000=0.14%
>     lat (msec) : 2=17.32%, 4=76.47%, 10=1.41%, 20=1.40%, 50=0.49%
>     lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
>   cpu          : usr=0.56%, sys=1.21%, ctx=252044, majf=0, minf=21
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      issued    : total=r=0/w=245760/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>   WRITE: io=30720MB, aggrb=47567KB/s, minb=47567KB/s, maxb=47567KB/s,
> mint=661312msec, maxt=661312msec
> 
> Disk stats (read/write):
>   sdl: ios=0/245789, merge=0/0, ticks=0/666944, in_queue=666556,
> util=98.28%
> 
> Thanks!
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com