Re: poor radosgw performance

Matt Thompson <wateringcan@xxxxxxxxx> · Fri, 20 Sep 2013 11:51:03 +0100

Hi Yehuda,
I did try bumping up pg_num on .rgw, .rgw.buckets, and .rgw.buckets.index from 8 to 220 prior to writing to the list but when I saw no difference in performance I set back to 8 (by creating new pools etc.)

One thing we have since noticed is that radosgw is validating tokens on each request; when we use ceph authentication instead we see much more promising results from swift-bench.

Is there a known issue w/ keystone token caching in radosgw?  It's my understanding that 10,000 tokens should be cached by default, however this doesn't appear to be the case.  I've explicitly set rgw_keystone_token_cache_size in /etc/ceph/ceph.conf on my radosgw node yet radosgw continues to hit keystone on each request.

Additionally, what does /var/lib/ceph/radosgw/ceph-radosgw.gateway get used for?  I see the docs mention that it needs to be created, yet it remains unpopulated on my nodes and doing a quick scan of ceph code I see no reference to that being used anywhere (thought I may be missing something).

Thanks again for the help!

-Matt

On Thu, Sep 19, 2013 at 5:01 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:

On Thu, Sep 19, 2013 at 8:52 AM, Matt Thompson <wateringcan@xxxxxxxxx> wrote:

> Hi All,

>

> We're trying to test swift API performance of swift itself (1.9.0) and

> ceph's radosgw (0.67.3) using the following hardware configuration:

>

> Shared servers:

>

> * 1 server running keystone for authentication

> * 1 server running swift-proxy, a single MON, and radosgw + Apache / FastCGI

>

> Ceph:

>

> * 4 storage servers, 5 storage disks / 5 OSDs on each (no separate disk(s)

> for journal)

>

> Swift:

>

> * 4 storage servers, 5 storage disks on each

>

> All 10 machines have identical hardware configurations (including drive type

> & speed).

>

> We deployed ceph w/ ceph-deploy and both swift and ceph have default

> configurations w/ the exception of the following:

>

> * custom Inktank packages for apache2 / libapache2-mod-fastcgi

> * rgw_print_continue enabled

> * rgw_enable_ops_log disabled

> * rgw_ops_log_rados disabled

> * debug_rgw disabled

>

> (actually, swift was deployed w/ a chef cookbook, so configurations may be

> slightly non-standard)

>

> On the ceph storage servers, filesystem type (XFS) and filesystem mount

> options, pg_nums on pools, etc. have all been left with the defaults (8 on

> the radosgw-related pools IIRC).

8 pgs per pool, especially for the data / index pools is awfully low,

and probably where your bottleneck is.

>

> Doing a preliminary test w/ swift-bench (concurrency = 10, object_size = 1),

> we're seeing the following:

>

> Ceph:

>

> 1000 PUTS **FINAL** [0 failures], 14.8/s

> 10000 GETS **FINAL** [0 failures], 40.9/s

> 1000 DEL **FINAL** [0 failures], 34.6/s

>

> Swift:

>

> 1000 PUTS **FINAL** [0 failures], 21.7/s

> 10000 GETS **FINAL** [0 failures], 139.5/s

> 1000 DEL **FINAL** [0 failures], 85.5/s

>

> That's a relatively significant difference.  Would we see any real

> difference in moving the journals to an SSD per server or separate partition

> per OSD disk?  These machines are not seeing any load short of what's being

maybe, but I think at this point you're hitting the low pgs issue.

> generated by swift-bench.  Alternatively, would we see any quick wins

> standing up more MONs or moving the MON off the server running radosgw +

> Apache / FastCGI?

don't think it's going to make much of a difference right now.

Yehuda

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com