Re: radosgw performance

Yehuda Sadeh <yehuda@xxxxxxxxxxx> · Mon, 18 Feb 2013 08:40:16 -0800

On Mon, Feb 18, 2013 at 2:53 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote:
> Answering in reverse order:
>
> We are using mod_fastcgi almost exactly in the documented way.
>
> I have deleted and recreated these pools with the following pg_nums
> .rgw.buckets 1024
> .rgw 64
> .rgw.gc 64
> .rgw.control 64
> .users.uid 64
> .users 64
> We currently have 32 OSDs but planning to increase the number 100+ shortly.
>
> Default 'admin socket' /var/run/ceph/$cluster-$name.asok does not exists for
> rgw and 'lsof -p <pid_of_rgw>' does not show any.
>
> Correlating logs revealed some interesting stuff though. With 'debug ms = 1'
> rgw logs show a few lines more. (summarizing for readability)
>
> 01:17:08.371741 TID 20 get_obj_state: rctx=0x7f8624002520 obj=b:key ...
> 01:17:08.371772 TID  1 -- rgw/1005379 --> osd.19/19507 -- osd_op(cli:316
> datafile [getxattrs,stat]...
> 01:17:08.373125 TID 20 prepare_atomic_for_write_impl: state is not atomic.
> state=0x7f8624010378
> 01:17:08.373241 TID  1 -- rgw/1005379 --> osd.3/7602 -- osd_op(cli:317
> .dir.4470.1 [call rgw.bucket_prepare_op]...
> 01:17:08.741306 TID  1 -- rgw/1005379 --> osd.19/19507 -- osd_op(cli:387
> datafile [create 0~0 ...
> 01:17:08.745096 TID  1 -- rgw/1005379 --> osd.3/7602 -- osd_op(cli:388
> .dir.4470.1 [call rgw.bucket_complete_op]...
> 01:17:08.745175 TID  2 req 58:0.373932:s3:PUT /b/key:put_obj:http status=200
>
> This shows the 0.37 sec PUT operation is almost all bucket_prepare_op for
> .dir.4470.1 on osd.3 showing client.6515.0:317 which is shorthened as
> cli:317. The relevant parts of debug log on osd.3 goes like (again
> summarizing):
>
> 01:17:08.379578 TID1  1 -- osd.3/7602 <== client.6515 rgw/1005379 114 ====
> osd_op(cli:317 .dir.4470.1 [call rgw.bucket_prepare_op] ...
> 01:17:08.379666 TID1 15 osd.3 288 enqueue_op ... latency 0.000243
> osd_op(cli:317 .dir.4470.1 [call rgw.bucket_prepare_op] ...
> 01:17:08.675843 TID2 10 osd.3 288 dequeue_op ... latency 0.296420
> osd_op(cli:317 .dir.4470.1 [call rgw.bucket_prepare_op] .. pg[10.11...]
> 01:17:08.675987 TID2 10 osd.3 pg_epoch: 288 pg[10.11...]  taking
> ondisk_read_lock
> 01:17:08.736329 TID2 10 osd.3 pg_epoch: 288 pg[10.11...] do_osd_op
> d91dcc11/.dir.4470.1/head//10 [call rgw.bucket_prepare_op]
> 01:17:08.736884 TID2  1 -- osd.3/7602 --> osd.26 osd.26/3582 --
> osd_sub_op(cli:317 10.11 d91dcc11/.dir.4470.1/head//10 ...
> 01:17:08.738700 TID3  1 -- osd.3/7602 <== osd.26 osd.26/3582 134 ====
> osd_sub_op_reply(cli:317 10.11 ...
> 01:17:08.738783 TID3 15 osd.3 288 enqueue_op ... latency 0.000229
> osd_sub_op_reply(cli:317 10.11 d91dcc11/.dir.4470.1/head//10 ...
> 01:17:08.746120 TID2 10 osd.3 288 dequeue_op ... latency 0.007566
> osd_sub_op_reply(cli:317 10.11 d91dcc11/.dir.4470.1/head//10 ...
> 01:17:08.746417 TID4  1 -- osd.3/7602 --> rgw/1005379 -- osd_op_reply(317
> .dir.4470.1 [call rgw.bucket_prepare_op] ack = 0) ...
>
> Here shows osd receiving the request from rgw and almost immediately queuing
> it. 0.29 secs pass while this operation gets a chance to actually work. It

Yeah, that's the culprit. That's around ~120 leveldb write operations
per second. Sam, does that number make sense to you?

> tries to get a read lock, which causes like a 0.06 second delay. Replication
> to another osd is super fast and after a few milliseconds, rgw is notified
> that the bucket_prepare_op is complete. All of these makes sense since
> 'rados bench' does not use any bucket_prepare or bucket_complete hence
> faster.
>
> What i read here in an architectural level is that all rgw operations must
> pass through .dir.<bucket-id> file, which happens to be a scalability killer
> for large buckets. Or buckets with a lot of activity for that matter. For a

Not all operations, only write operations.

> large number of operations, it won't matter how much OSD's you'd have since
> this file will always sit in a single pg. Is that correct? Can changing
> (increasing) op threads or disk threads make any difference in that matter?
> Or is it possible to disable this operation in any way?

In theory yes, but then you won't be able to list buckets.
We had some discussions about splitting bucket index into multiple
objects. That'll help with the issue, at the cost of getting bucket
listing/stats being slower.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com