Re: Slow ceph io. High iops. Compared to hadoop.

Andrey Stepachev <octo47@xxxxxxxxx> · Wed, 18 Jan 2012 01:19:34 +0400

2012/1/17 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>:
> On Tue, Jan 17, 2012 at 11:37 AM, Andrey Stepachev <octo47@xxxxxxxxx> wrote:
>> 2012/1/17 Sage Weil <sage@xxxxxxxxxxxx>:
>>> On Mon, 16 Jan 2012, Andrey Stepachev wrote:
>>>> Ops. It is really a buffer problem.
>>>> It can be checked with easy with ceph osd tell 4 bench
>>>>
>>>> bench: wrote 1024 MB in blocks of 15625 KB in 17.115538 sec at 61264 KB/sec
>>>> bench: wrote 1024 MB in blocks of 122 MB in 12.281531 sec at 85378 KB/sec
>>>> bench: wrote 1024 MB in blocks of 244 MB in 13.529501 sec at 77502 KB/sec
>>>> bench: wrote 3814 MB in blocks of 488 MB in 30.909198 sec at 123 MB/sec
>>>>
>>>> and in last case dstat show 'iozone-like iops'
>>>>
>>>>  100: 100: 100: 100|  0   0  94   6   0   0|   0   538M|   0   238
>>>>  100: 100: 100: 100|  0   0  96   3   0   0|   0   525M|   0   133
>>>>  100: 100: 100: 100|  0   2  95   3   0   0|   0   497M|   0   128
>>>> 18.0:39.0:30.0:27.0|  0   3  96   2   0   0|   0   144M|   0  40.0
>>>> 75.0:74.0:72.0:67.0|  0   3  95   2   0   0|   0   103M|   0  2698
>>>>  100: 100: 100: 100|  0  13  83   4   0   0|   0   484M|   0   125
>>>>  100: 100: 100: 100|  0   0 100   0   0   0|   0   486M|   0   124
>>>>  100: 100: 100: 100|  0   3  88   9   0   0|   0   476M|   0   123
>>>>
>>>> Now question arises: how ceph can be tuned to gain such
>>>> performance in normal operations, not in bench?
>>>
>>> This may be related to how your OSD journaling is configured.  I'm
>>> guessing it's set to a file inside the btrfs volume holding the data?
>>
>> Yes. After some investigation I found, that it is more related to
>> fact, that FileStore process incoming data in transactions.
>> So we have many relatively small transactions (~1-4Mb).
>> As opposite, in hadoop we have no transactions, we have
>> simple stream of packets, wich a written to a file.
> Ceph has a lot of design goals which mean it's not aimed quite as
> squarely at MapReduce as HDFS is, thus these sorts of differences.

May be in the future it will be possible to make separate operation
like STREAM especially for hadoop-like workload.
But I can't propose some  idea right now how to implement
such solution (in case of existing transactional system).

>
> Can you make sure that your RADOS objects are 64MB in size? They
> should be set to that, but it's possible something in the path got
> broken. Otherwise we may need to set up configuration options for the
> size of things to send over the wire... :/

show_layout shows expected layout.
If I disable buffering in java, and rely only on ceph (and send
by 10MB), I've got such logs:
000~10000000 = 10000000
2012-01-17 14:35:41.990987 7f4a6306c700 filestore(/data/osd.1) write
0.17b_head/10000003c90.00000000/head/2c91397b 20000000~10000000
2012-01-17 14:35:41.997599 7f4a6306c700 filestore(/data/osd.1)
queue_flusher ep 5 fd 40 20000000~10000000 qlen 1
2012-01-17 14:35:41.997642 7f4a6306c700 filestore(/data/osd.1) write
0.17b_head/10000003c90.00000000/head/2c91397b 20000000~10000000 =
10000000
2012-01-17 14:35:42.192107 7f4a6286b700 filestore(/data/osd.1) write
0.17b_head/10000003c90.00000000/head/2c91397b 30000000~10000000
2012-01-17 14:35:42.197386 7f4a6286b700 filestore(/data/osd.1)
queue_flusher ep 5 fd 39 30000000~10000000 qlen 1
2012-01-17 14:35:42.197415 7f4a6286b700 filestore(/data/osd.1) write
0.17b_head/10000003c90.00000000/head/2c91397b 30000000~10000000 =
10000000
2012-01-17 14:35:42.603251 7f4a6306c700 filestore(/data/osd.1) write
0.17b_head/10000003c90.00000000/head/2c91397b 40000000~10000000

Also, it is easy to see, that buffers comes with very close timestamps,
but queued as separate ops. FileStore issues one write for such op.
(in some cases one do_transactions process 2-3 op for one object).
This looks like we can optimize this by collocating op by object.
But lack of knowlege of internals of the ceph don't allow me
to implement such prototype right now.

I could not find how I can influence the ceph to
merge incoming packets in one op.

>
>
>> To achieve good performance we can:
>> a) of course move journal away (but I can't test such
>> configuration, by hardware is limited to one big raid).
>> b) think about possibility of collocation transaction
>> for the same object into one bigger transaction.
>> Even if we move journal away we'll still write data
>> in small pieces (as test shows instead of 10-100s megabytes
>> will write 1-4mb per request).
> Yeah; Ceph likes 4MB writes; that's its default for many many things.
> I'm surprised this is causing trouble for your RAID controller though!
> Could you maybe set it up in a different mode and save one disk for journaling?

I've moved journal to tmpfs (I have plenty of RAM), and no significant
changes.Still high iops, limiting overall throughtput.
96.0:96.0:96.0:96.0|   0   140M|   0  3075
90.0:93.0:91.0:87.0|   0   287M|   0  1434
66.0:74.0:92.0:87.0|4096B  215M|1.00  1861
51.0:53.0:77.0:73.0|   0   194M|   0  2302
41.0:40.0:40.0:41.0|   0   206M|   0  2262
95.0:92.0:94.0:94.0|   0   129M|   0  3581

>
>> I can completely wrong, but what I see in the code, and in
>> logs suggests to me that I'm right.
>
>> (BTW, do you missed my hadoop related patches,
>> or something wrong with them?)
> Sorry for the delay; we had a long weekend and I didn't get to them
> last week. I'm building them right now and will push once I've
> finished (after lunch probably, we've got a thing). :)
Bon appétit! :)

> -Greg

-- 
Andrey.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html