Re: Slow ceph io. High iops. Compared to hadoop.

Andrey Stepachev <octo47@xxxxxxxxx> · Tue, 17 Jan 2012 23:37:09 +0400

2012/1/17 Sage Weil <sage@xxxxxxxxxxxx>:
> On Mon, 16 Jan 2012, Andrey Stepachev wrote:
>> Ops. It is really a buffer problem.
>> It can be checked with easy with ceph osd tell 4 bench
>>
>> bench: wrote 1024 MB in blocks of 15625 KB in 17.115538 sec at 61264 KB/sec
>> bench: wrote 1024 MB in blocks of 122 MB in 12.281531 sec at 85378 KB/sec
>> bench: wrote 1024 MB in blocks of 244 MB in 13.529501 sec at 77502 KB/sec
>> bench: wrote 3814 MB in blocks of 488 MB in 30.909198 sec at 123 MB/sec
>>
>> and in last case dstat show 'iozone-like iops'
>>
>>  100: 100: 100: 100|  0   0  94   6   0   0|   0   538M|   0   238
>>  100: 100: 100: 100|  0   0  96   3   0   0|   0   525M|   0   133
>>  100: 100: 100: 100|  0   2  95   3   0   0|   0   497M|   0   128
>> 18.0:39.0:30.0:27.0|  0   3  96   2   0   0|   0   144M|   0  40.0
>> 75.0:74.0:72.0:67.0|  0   3  95   2   0   0|   0   103M|   0  2698
>>  100: 100: 100: 100|  0  13  83   4   0   0|   0   484M|   0   125
>>  100: 100: 100: 100|  0   0 100   0   0   0|   0   486M|   0   124
>>  100: 100: 100: 100|  0   3  88   9   0   0|   0   476M|   0   123
>>
>> Now question arises: how ceph can be tuned to gain such
>> performance in normal operations, not in bench?
>
> This may be related to how your OSD journaling is configured.  I'm
> guessing it's set to a file inside the btrfs volume holding the data?

Yes. After some investigation I found, that it is more related to
fact, that FileStore process incoming data in transactions.
So we have many relatively small transactions (~1-4Mb).
As opposite, in hadoop we have no transactions, we have
simple stream of packets, wich a written to a file.

To achieve good performance we can:
a) of course move journal away (but I can't test such
configuration, by hardware is limited to one big raid).
b) think about possibility of collocation transaction
for the same object into one bigger transaction.
Even if we move journal away we'll still write data
in small pieces (as test shows instead of 10-100s megabytes
will write 1-4mb per request).

I can completely wrong, but what I see in the code, and in
logs suggests to me that I'm right.

(BTW, do you missed my hadoop related patches,
or something wrong with them?)

>
> sage
>
>>
>> 2012/1/16 Andrey Stepachev <octo47@xxxxxxxxx>:
>> > Hi all.
>> >
>> > Last week I've investigate the status for hadoop on ceph.
>> > I create some patches to remove some bugs and crashes.
>> > Looks like it works. Even hbase works on top.
>> >
>> > For reference all sources and patches are here
>> >
>> > https://github.com/octo47/hadoop-common/tree/branch-1.0-ceph
>> > https://github.com/octo47/ceph/tree/v0.40-hadoop
>> >
>> > After YCSB and TestDSFIO work without crashes i start investigate
>> > performance.
>> >
>> > I have 5node cluster with 4 sata disks. btrfs. 24core on each.
>> > raid. iozone shows up to 520MB/s.
>> >
>> > Performance differs in 2-3 times. After some tests i see strange thing.
>> > hadoop uses disk very close to iozone: small amount and iops and high
>> > throughtput (same as iozone).
>> > ceph uses very inefficient: huge amount of iops, up to 3 times less
>> > throughtput (i think because of high amount of iops).
>> >
>> > hadoop dstat output:
>> > sda--sdb--sdc--sdd- ----total-cpu-usage---- -dsk/total- --io/total-
>> > util:util:util:util|usr sys idl wai hiq siq| read  writ| read  writ
>> >  100: 100: 100: 100|  1   5  83  11   0   0|   0   529M|   0   247
>> >  100: 100: 100: 100|  1   0  83  16   0   0|   0   542M|   0   168
>> >  100: 100: 100: 100|  1   0  81  18   0   0|  28k  518M|6.00   149
>> >  100: 100: 100: 100|  1   4  77  17   0   0|   0   533M|   0   243
>> >  100: 100: 100: 100|  1   3  83  13   0   0|   0   523M|   0   264
>> >
>> > ceph dstat output:
>> > ===================================================
>> > sda--sdb--sdc--sdd- ----total-cpu-usage---- -dsk/total- --io/total-
>> > util:util:util:util|usr sys idl wai hiq siq| read  writ| read  writ
>> > 68.0:70.0:79.0:76.0|  1   2  93   4   0   0|   0   195M|   0  1723
>> > 86.0:85.0:93.0:91.0|  1   2  91   5   0   0|   0   226M|   0  1816
>> > 85.0:85.0:85.0:84.0|  1   3  92   4   0   0|   0   235M|   0  2316
>> >
>> >
>> > So, my question is: can someone point me:
>> > a) can it be because of inefficient buffer size on osd part
>> > (i tried to increase CephOutputStream buffer to 256kb, not helps)
>> > b) what other problems can be and what options can i tune
>> > to find out what is going on.
>> >
>> > PS: i can't use iozone on kernel mounted fs. something
>> > hang in kernel, only reboot helps.
>> > in /var/log/messages i see attached kern.log.
>> >
>> >
>> >
>> > --
>> > Andrey.
>>
>>
>>
>> --
>> Andrey.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

-- 
Andrey.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html