Re: radosgw performance

Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> · Mon, 18 Feb 2013 12:53:25 +0200

Answering in reverse order:

We are using mod_fastcgi almost exactly in the documented way.

I have deleted and recreated these pools with the following pg_nums
.rgw.buckets 1024
.rgw 64
.rgw.gc 64
.rgw.control 64
.users.uid 64
.users 64
We currently have 32 OSDs but planning to increase the number 100+ shortly.

Default 'admin socket' /var/run/ceph/$cluster-$name.asok does not exists for rgw and 'lsof -p <pid_of_rgw>' does not show any.

Correlating logs revealed some interesting stuff though. With 'debug ms = 1' rgw logs show a few lines more. (summarizing for readability)

01:17:08.371741 TID 20 get_obj_state: rctx=0x7f8624002520 obj=b:key ...
01:17:08.371772 TID  1 -- rgw/1005379 --> osd.19/19507 -- osd_op(cli:316 datafile [getxattrs,stat]...
01:17:08.373125 TID 20 prepare_atomic_for_write_impl: state is not atomic. state=0x7f8624010378
01:17:08.373241 TID  1 -- rgw/1005379 --> osd.3/7602 -- osd_op(cli:317 .dir.4470.1 [call rgw.bucket_prepare_op]...
01:17:08.741306 TID  1 -- rgw/1005379 --> osd.19/19507 -- osd_op(cli:387 datafile [create 0~0 ...
01:17:08.745096 TID  1 -- rgw/1005379 --> osd.3/7602 -- osd_op(cli:388 .dir.4470.1 [call rgw.bucket_complete_op]...
01:17:08.745175 TID  2 req 58:0.373932:s3:PUT /b/key:put_obj:http status=200

This shows the 0.37 sec PUT operation is almost all bucket_prepare_op for .dir.4470.1 on osd.3 showing client.6515.0:317 which is shorthened as cli:317. The relevant parts of debug log on osd.3 goes like (again summarizing):

01:17:08.379578 TID1  1 -- osd.3/7602 <== client.6515 rgw/1005379 114 ==== osd_op(cli:317 .dir.4470.1 [call rgw.bucket_prepare_op] ...
01:17:08.379666 TID1 15 osd.3 288 enqueue_op ... latency 0.000243 osd_op(cli:317 .dir.4470.1 [call rgw.bucket_prepare_op] ...
01:17:08.675843 TID2 10 osd.3 288 dequeue_op ... latency 0.296420 osd_op(cli:317 .dir.4470.1 [call rgw.bucket_prepare_op] .. pg[10.11...]
01:17:08.675987 TID2 10 osd.3 pg_epoch: 288 pg[10.11...]  taking ondisk_read_lock
01:17:08.736329 TID2 10 osd.3 pg_epoch: 288 pg[10.11...] do_osd_op d91dcc11/.dir.4470.1/head//10 [call rgw.bucket_prepare_op]
01:17:08.736884 TID2  1 -- osd.3/7602 --> osd.26 osd.26/3582 -- osd_sub_op(cli:317 10.11 d91dcc11/.dir.4470.1/head//10 ...
01:17:08.738700 TID3  1 -- osd.3/7602 <== osd.26 osd.26/3582 134 ==== osd_sub_op_reply(cli:317 10.11 ...
01:17:08.738783 TID3 15 osd.3 288 enqueue_op ... latency 0.000229 osd_sub_op_reply(cli:317 10.11 d91dcc11/.dir.4470.1/head//10 ...
01:17:08.746120 TID2 10 osd.3 288 dequeue_op ... latency 0.007566 osd_sub_op_reply(cli:317 10.11 d91dcc11/.dir.4470.1/head//10 ...
01:17:08.746417 TID4  1 -- osd.3/7602 --> rgw/1005379 -- osd_op_reply(317 .dir.4470.1 [call rgw.bucket_prepare_op] ack = 0) ...

Here shows osd receiving the request from rgw and almost immediately queuing it. 0.29 secs pass while this operation gets a chance to actually work. It tries to get a read lock, which causes like a 0.06 second delay. Replication to another osd is super fast and after a few milliseconds, rgw is notified that the bucket_prepare_op is complete. All of these makes sense since 'rados bench' does not use any bucket_prepare or bucket_complete hence faster.

What i read here in an architectural level is that all rgw operations must pass through .dir.<bucket-id> file, which happens to be a scalability killer for large buckets. Or buckets with a lot of activity for that matter. For a large number of operations, it won't matter how much OSD's you'd have since this file will always sit in a single pg. Is that correct? Can changing (increasing) op threads or disk threads make any difference in that matter? Or is it possible to disable this operation in any way?

Thanks in advance

On Mon, Feb 18, 2013 at 6:52 AM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:

On Sun, Feb 17, 2013 at 2:30 PM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote:

> Hi Yehuda,

>

> Thanks for fast reply.

> We are running 0.56.3. On ubuntu 12.04.

>

> I disabled ops log using

> 'rgw enable ops log = false'

> but i missed 'debug rgw = 0' so what i did was simply 'log file = /dev/null'

> :)

>

> Anyways. i tried 'debug rgw = 0' but it didn't change anything. To see

> what's going on in rgw i did 'debug rgw=2' and rerun the test.

>

> # cat radosgw.log |grep put_obj |grep status=200

> ...

> 2013-02-17 13:51:37.415981 7f54167f4700  2 req 613:0.327243:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object611:put_obj:http status=200

> 2013-02-17 13:51:37.431779 7f54137ee700  2 req 614:0.318996:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object612:put_obj:http status=200

> 2013-02-17 13:51:37.447688 7f53f37ae700  2 req 615:0.319085:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object613:put_obj:http status=200

> 2013-02-17 13:51:37.460531 7f53fbfbf700  2 req 581:0.887859:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object579:put_obj:http status=200

> 2013-02-17 13:51:37.468215 7f5411feb700  2 req 616:0.326575:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object614:put_obj:http status=200

> 2013-02-17 13:51:37.480233 7f54267fc700  2 req 617:0.335292:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object615:put_obj:http status=200

> 2013-02-17 13:51:37.503042 7f54147f0700  2 req 618:0.330277:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object616:put_obj:http status=200

> 2013-02-17 13:51:37.519647 7f5413fef700  2 req 619:0.306762:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object617:put_obj:http status=200

> 2013-02-17 13:51:37.520274 7f5427fff700  2 req 620:0.307374:s3:PUT

> /mybucket/benchmark_data_ceph-10_31019_object618:put_obj:http status=200

> ...

>

> If i read this correctly requests take 0.32 secs on average. Again, if i'm

> looking at things the right way, that would make 320 secs for like 1175

> requests, and if i divide that by 20 paralel requests that would give 18.8

> secs, which means for a 20 sec test most of the time is passed at radosgw

> <-> rados.

It would be interesting looking at the messenger log (debug ms = 1)

and correlate it with the rgw log (debug rgw = 20). You'll need to

look at a specific request that's slow and filter by the thread id

(the hex number in the 3rd column). It'll contain most of the

interesting stuff, except for the response messages.

>

> i'm not sure how to check apache <-> radosgw or if that's relevant. But i

> can say there is no problem between rest-client <-> apache as i am getting

> the same throughput with a bunch of other clients.

>

> i noticed that earlier but rgw does not have any admin socket, i guess i'm

> running defaults in this matter. Is there an option missing to have an admin

> socket?

'admin socket', by default points at /var/run/ceph/$cluster-$name.asok.

>

> Since i was able to increase rados pool performance but not the radosgw

> performance, the only thing that makes sense to me is that i forgot the

> delete some configuration or something while i delete the original .rgw.

> pools. All i did was 'ceph osd pool delete' is there something more to

> clean?

there are a bunch of pools there. 'rados lspools' will give you the

list, everything that starts with .rgw is relevant.

By the way, what fastcgi module are you using?

Yehuda

-- 
erdem agaoglu

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com