Re: Having trouble getting good performance

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Fri, 24 Apr 2015 13:14:08 -0600

The client ACKs the write as soon as it is in the journal. I suspect that the primary OSD dispatches the write to all the secondary OSDs at the same time so that it happens in parallel, but I am not an authority on that.
The journal writes data serially even if it comes in randomly. There is some time that data is allowed to sit in the journal before it has to be flushed to the disk. When the data is flushed, it can reorder and consolidate the writes in batches so that it can be as efficient as possible.

That is why SSD journals can offer large performance improvements for Ceph because the client ACKs as soon as the journal write is done and a lot of the random access is done by the SSDs. Remember that on a spindle, the journal has to be interrupted for reads and the head is having to travel all over the disk, SSDs help buffer that a lot so that spindles can spend more time servicing reads.

On Fri, Apr 24, 2015 at 11:40 AM, J David <j.david.lists@xxxxxxxxx> wrote:
On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:

> 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size

> increases the number of iops will start to fall. You will probably get

> around 70 iops for 128kb. But please benchmark your raw disks to get some

> accurate numbers if needed.

>

> Next when you use on-disk journals you write 1st to the journal and then

> write the actual data. There is also a small levelDB write which stores ceph

> metadata so depending on IO size you will get slightly less than half the

> native disk performance.

>

> You then have 2 copies, as Ceph won't ACK until both copies have been

> written the average latency will tend to stray upwards.

What is the purpose of the journal if Ceph waits for the actual write

to complete anyway?

I.e. with a hardware raid card with a BBU, the raid card tells the

host that the data is guaranteed safe as soon as it has been written

to the BBU.

Does this also mean that all the writing internal to ceph happens synchronously?

I.e. all these operations are serialized:

copy1-journal-write -> copy1-data-write -> copy2-journal-write ->

copy2-data-write -> OK, client, you're done.

Since copy1 and copy2 are on completely different physical hardware,

shouldn't those operations be able to proceed more or less

independently?  And shouldn't the client be done as soon as the

journal is written?  I.e.:

copy1-journal-write -v- copy1-data-write

copy2-journal-write -|- copy1-data-write

                             +-> OK, client, you're done

If so, shouldn't the effective latency be that of one operation, not

four?  Plus all the non-trivial overhead for scheduling, LevelDB,

network latency, etc.

For the "getting jackhammered by zillions of clients" case, your

estimate probably holds more true, because even if writes aren't in

the critical path they still happen and sooner or later the drive runs

out of IOPs and things start getting in each others' way.  But for a

single client, single thread case where the cluster is *not* 100%

utilized, shouldn't the effective latency be much less?

The other thing about this that I don't quite understand, and the

thing initially had me questioning whether there was something wrong

on the Ceph side is that your estimate is based primarily on the

mechanical capabilities of the drives.  Yet, in practice, when the

Ceph cluster is tapped out for I/O in this situation, iostat says none

of the physical drives are more than 10-20% busy and doing 10-20 IOPs

to write a couple of MB/sec.  And those are the "loaded" ones at any

given time.  Many are <10%.  In fact, *none* of the hardware on the

Ceph side is anywhere close to fully utilized.  If the performance of

this cluster is limited by its hardware, shouldn't there be some

evidence of that somewhere?

To illustrate, I marked a physical drive out and waited for things to

settle down, then ran fio on the physical drive (128KB randwrite

numjobs=1 iodepth=1).  It yields a very different picture of the

drive's physical limits.

The drive during "maxxed out" client writes:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s

avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

sdl               0.00     0.20    4.80   13.40    23.60  2505.65

277.94     0.26   14.07   16.08   13.34   6.68  12.16

The same drive under fio:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s

avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

sdl               0.00     0.00    0.00  377.50     0.00 48320.00

256.00     0.99    2.62    0.00    2.62   2.62  98.72

You could make the argument that for we are seeing half the throughput

on the same test because ceph is write-doubling (journal+data) and the

reason no drive is highly utilized is because the load is being spread

out.  So each of 28 drives actually is being maxed out, but only 3.5%

of the time, leading to low apparent utilization because the

measurement interval is too long.  And maybe that is exactly what is

happening.  For that to be true, the two OSD writes would indeed have

to happen in parallel, not sequentially.  (Which is what it's supposed

to do, I believe?)

But why does a client have to wait for both writes?  Isn't the journal

enough?  If it isn't, shouldn't it be?  And if it isn't, wouldn't

moving to even an infinitely fast SSD journal only double the

performance, since the second write still has to happen?

In case they are of interest, the native drive fio results are below.

testfile: (groupid=0, jobs=1): err= 0: pid=20562

  write: io=30720MB, bw=47568KB/s, iops=371 , runt=661312msec

    slat (usec): min=13 , max=4087 , avg=34.08, stdev=25.36

    clat (usec): min=2 , max=736605 , avg=2650.22, stdev=6368.02

     lat (usec): min=379 , max=736640 , avg=2684.80, stdev=6368.00

    clat percentiles (usec):

     |  1.00th=[  466],  5.00th=[ 1576], 10.00th=[ 1800], 20.00th=[ 1992],

     | 30.00th=[ 2128], 40.00th=[ 2224], 50.00th=[ 2320], 60.00th=[ 2416],

     | 70.00th=[ 2512], 80.00th=[ 2640], 90.00th=[ 2864], 95.00th=[ 3152],

     | 99.00th=[10688], 99.50th=[20352], 99.90th=[29056], 99.95th=[29568],

     | 99.99th=[452608]

    bw (KB/s)  : min= 1022, max=88910, per=100.00%, avg=47982.41, stdev=7115.74

    lat (usec) : 4=0.01%, 500=1.52%, 750=1.23%, 1000=0.14%

    lat (msec) : 2=17.32%, 4=76.47%, 10=1.41%, 20=1.40%, 50=0.49%

    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%

  cpu          : usr=0.56%, sys=1.21%, ctx=252044, majf=0, minf=21

  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     issued    : total=r=0/w=245760/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):

  WRITE: io=30720MB, aggrb=47567KB/s, minb=47567KB/s, maxb=47567KB/s,

mint=661312msec, maxt=661312msec

Disk stats (read/write):

  sdl: ios=0/245789, merge=0/0, ticks=0/666944, in_queue=666556, util=98.28%

Thanks!

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com