Re: Slow ceph io. High iops. Compared to hadoop.

Andrey Stepachev <octo47@xxxxxxxxx> · Mon, 16 Jan 2012 15:23:27 +0400

Ops. It is really a buffer problem.
It can be checked with easy with ceph osd tell 4 bench

bench: wrote 1024 MB in blocks of 15625 KB in 17.115538 sec at 61264 KB/sec
bench: wrote 1024 MB in blocks of 122 MB in 12.281531 sec at 85378 KB/sec
bench: wrote 1024 MB in blocks of 244 MB in 13.529501 sec at 77502 KB/sec
bench: wrote 3814 MB in blocks of 488 MB in 30.909198 sec at 123 MB/sec

and in last case dstat show 'iozone-like iops'

 100: 100: 100: 100|  0   0  94   6   0   0|   0   538M|   0   238
 100: 100: 100: 100|  0   0  96   3   0   0|   0   525M|   0   133
 100: 100: 100: 100|  0   2  95   3   0   0|   0   497M|   0   128
18.0:39.0:30.0:27.0|  0   3  96   2   0   0|   0   144M|   0  40.0
75.0:74.0:72.0:67.0|  0   3  95   2   0   0|   0   103M|   0  2698
 100: 100: 100: 100|  0  13  83   4   0   0|   0   484M|   0   125
 100: 100: 100: 100|  0   0 100   0   0   0|   0   486M|   0   124
 100: 100: 100: 100|  0   3  88   9   0   0|   0   476M|   0   123

Now question arises: how ceph can be tuned to gain such
performance in normal operations, not in bench?

2012/1/16 Andrey Stepachev <octo47@xxxxxxxxx>:
> Hi all.
>
> Last week I've investigate the status for hadoop on ceph.
> I create some patches to remove some bugs and crashes.
> Looks like it works. Even hbase works on top.
>
> For reference all sources and patches are here
>
> https://github.com/octo47/hadoop-common/tree/branch-1.0-ceph
> https://github.com/octo47/ceph/tree/v0.40-hadoop
>
> After YCSB and TestDSFIO work without crashes i start investigate
> performance.
>
> I have 5node cluster with 4 sata disks. btrfs. 24core on each.
> raid. iozone shows up to 520MB/s.
>
> Performance differs in 2-3 times. After some tests i see strange thing.
> hadoop uses disk very close to iozone: small amount and iops and high
> throughtput (same as iozone).
> ceph uses very inefficient: huge amount of iops, up to 3 times less
> throughtput (i think because of high amount of iops).
>
> hadoop dstat output:
> sda--sdb--sdc--sdd- ----total-cpu-usage---- -dsk/total- --io/total-
> util:util:util:util|usr sys idl wai hiq siq| read  writ| read  writ
>  100: 100: 100: 100|  1   5  83  11   0   0|   0   529M|   0   247
>  100: 100: 100: 100|  1   0  83  16   0   0|   0   542M|   0   168
>  100: 100: 100: 100|  1   0  81  18   0   0|  28k  518M|6.00   149
>  100: 100: 100: 100|  1   4  77  17   0   0|   0   533M|   0   243
>  100: 100: 100: 100|  1   3  83  13   0   0|   0   523M|   0   264
>
> ceph dstat output:
> ===================================================
> sda--sdb--sdc--sdd- ----total-cpu-usage---- -dsk/total- --io/total-
> util:util:util:util|usr sys idl wai hiq siq| read  writ| read  writ
> 68.0:70.0:79.0:76.0|  1   2  93   4   0   0|   0   195M|   0  1723
> 86.0:85.0:93.0:91.0|  1   2  91   5   0   0|   0   226M|   0  1816
> 85.0:85.0:85.0:84.0|  1   3  92   4   0   0|   0   235M|   0  2316
>
>
> So, my question is: can someone point me:
> a) can it be because of inefficient buffer size on osd part
> (i tried to increase CephOutputStream buffer to 256kb, not helps)
> b) what other problems can be and what options can i tune
> to find out what is going on.
>
> PS: i can't use iozone on kernel mounted fs. something
> hang in kernel, only reboot helps.
> in /var/log/messages i see attached kern.log.
>
>
>
> --
> Andrey.

-- 
Andrey.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html