Re: Slow ceph io. High iops. Compared to hadoop.

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Tue, 17 Jan 2012 11:54:02 -0800

On Tue, Jan 17, 2012 at 11:37 AM, Andrey Stepachev <octo47@xxxxxxxxx> wrote:
> 2012/1/17 Sage Weil <sage@xxxxxxxxxxxx>:
>> On Mon, 16 Jan 2012, Andrey Stepachev wrote:
>>> Ops. It is really a buffer problem.
>>> It can be checked with easy with ceph osd tell 4 bench
>>>
>>> bench: wrote 1024 MB in blocks of 15625 KB in 17.115538 sec at 61264 KB/sec
>>> bench: wrote 1024 MB in blocks of 122 MB in 12.281531 sec at 85378 KB/sec
>>> bench: wrote 1024 MB in blocks of 244 MB in 13.529501 sec at 77502 KB/sec
>>> bench: wrote 3814 MB in blocks of 488 MB in 30.909198 sec at 123 MB/sec
>>>
>>> and in last case dstat show 'iozone-like iops'
>>>
>>>  100: 100: 100: 100|  0   0  94   6   0   0|   0   538M|   0   238
>>>  100: 100: 100: 100|  0   0  96   3   0   0|   0   525M|   0   133
>>>  100: 100: 100: 100|  0   2  95   3   0   0|   0   497M|   0   128
>>> 18.0:39.0:30.0:27.0|  0   3  96   2   0   0|   0   144M|   0  40.0
>>> 75.0:74.0:72.0:67.0|  0   3  95   2   0   0|   0   103M|   0  2698
>>>  100: 100: 100: 100|  0  13  83   4   0   0|   0   484M|   0   125
>>>  100: 100: 100: 100|  0   0 100   0   0   0|   0   486M|   0   124
>>>  100: 100: 100: 100|  0   3  88   9   0   0|   0   476M|   0   123
>>>
>>> Now question arises: how ceph can be tuned to gain such
>>> performance in normal operations, not in bench?
>>
>> This may be related to how your OSD journaling is configured.  I'm
>> guessing it's set to a file inside the btrfs volume holding the data?
>
> Yes. After some investigation I found, that it is more related to
> fact, that FileStore process incoming data in transactions.
> So we have many relatively small transactions (~1-4Mb).
> As opposite, in hadoop we have no transactions, we have
> simple stream of packets, wich a written to a file.
Ceph has a lot of design goals which mean it's not aimed quite as
squarely at MapReduce as HDFS is, thus these sorts of differences.

Can you make sure that your RADOS objects are 64MB in size? They
should be set to that, but it's possible something in the path got
broken. Otherwise we may need to set up configuration options for the
size of things to send over the wire... :/

> To achieve good performance we can:
> a) of course move journal away (but I can't test such
> configuration, by hardware is limited to one big raid).
> b) think about possibility of collocation transaction
> for the same object into one bigger transaction.
> Even if we move journal away we'll still write data
> in small pieces (as test shows instead of 10-100s megabytes
> will write 1-4mb per request).
Yeah; Ceph likes 4MB writes; that's its default for many many things.
I'm surprised this is causing trouble for your RAID controller though!
Could you maybe set it up in a different mode and save one disk for journaling?

> I can completely wrong, but what I see in the code, and in
> logs suggests to me that I'm right.

> (BTW, do you missed my hadoop related patches,
> or something wrong with them?)
Sorry for the delay; we had a long weekend and I didn't get to them
last week. I'm building them right now and will push once I've
finished (after lunch probably, we've got a thing). :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html