Re: poor radosgw performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/20/2013 05:51 AM, Matt Thompson wrote:
Hi Yehuda,

I did try bumping up pg_num on .rgw, .rgw.buckets, and .rgw.buckets.index from 8 to 220 prior to writing to the list but when I saw no difference in performance I set back to 8 (by creating new pools etc.)

Hi Matt,

You'll want to bump these back up.  They may not be hurting now if there is another bottleneck, but this could easily become the bottleneck under load.


One thing we have since noticed is that radosgw is validating tokens on each request; when we use ceph authentication instead we see much more promising results from swift-bench.

Interesting!  I haven't looked into this at all.  Can you describe more about your test?


Is there a known issue w/ keystone token caching in radosgw?  It's my understanding that 10,000 tokens should be cached by default, however this doesn't appear to be the case.  I've explicitly set rgw_keystone_token_cache_size in /etc/ceph/ceph.conf on my radosgw node yet radosgw continues to hit keystone on each request.

Additionally, what does /var/lib/ceph/radosgw/ceph-radosgw.gateway get used for?  I see the docs mention that it needs to be created, yet it remains unpopulated on my nodes and doing a quick scan of ceph code I see no reference to that being used anywhere (thought I may be missing something).

Thanks again for the help!

Other things:

What version of swift-bench are you using?  The newer versions lets you write into multiple containers which may be worth trying to make sure you aren't getting hung up on container indexes.  Another thing to mention:  0.67.3 has a bug in it that dramatically slows down performance which is fixed in wip-6286.  With small objects especially you may see a dramatic result.

Otherwise, we are also investigating the effect of our OSD directory splitting behaviour on RGW performance.  With low PG counts and high Object counts, this may be causing performance issues as well.


-Matt



On Thu, Sep 19, 2013 at 5:01 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:
On Thu, Sep 19, 2013 at 8:52 AM, Matt Thompson <wateringcan@xxxxxxxxx> wrote:
> Hi All,
>
> We're trying to test swift API performance of swift itself (1.9.0) and
> ceph's radosgw (0.67.3) using the following hardware configuration:
>
> Shared servers:
>
> * 1 server running keystone for authentication
> * 1 server running swift-proxy, a single MON, and radosgw + Apache / FastCGI
>
> Ceph:
>
> * 4 storage servers, 5 storage disks / 5 OSDs on each (no separate disk(s)
> for journal)
>
> Swift:
>
> * 4 storage servers, 5 storage disks on each
>
> All 10 machines have identical hardware configurations (including drive type
> & speed).
>
> We deployed ceph w/ ceph-deploy and both swift and ceph have default
> configurations w/ the exception of the following:
>
> * custom Inktank packages for apache2 / libapache2-mod-fastcgi
> * rgw_print_continue enabled
> * rgw_enable_ops_log disabled
> * rgw_ops_log_rados disabled
> * debug_rgw disabled
>
> (actually, swift was deployed w/ a chef cookbook, so configurations may be
> slightly non-standard)
>
> On the ceph storage servers, filesystem type (XFS) and filesystem mount
> options, pg_nums on pools, etc. have all been left with the defaults (8 on
> the radosgw-related pools IIRC).

8 pgs per pool, especially for the data / index pools is awfully low,
and probably where your bottleneck is.

>
> Doing a preliminary test w/ swift-bench (concurrency = 10, object_size = 1),
> we're seeing the following:
>
> Ceph:
>
> 1000 PUTS **FINAL** [0 failures], 14.8/s
> 10000 GETS **FINAL** [0 failures], 40.9/s
> 1000 DEL **FINAL** [0 failures], 34.6/s
>
> Swift:
>
> 1000 PUTS **FINAL** [0 failures], 21.7/s
> 10000 GETS **FINAL** [0 failures], 139.5/s
> 1000 DEL **FINAL** [0 failures], 85.5/s
>
> That's a relatively significant difference.  Would we see any real
> difference in moving the journals to an SSD per server or separate partition
> per OSD disk?  These machines are not seeing any load short of what's being

maybe, but I think at this point you're hitting the low pgs issue.

> generated by swift-bench.  Alternatively, would we see any quick wins
> standing up more MONs or moving the MON off the server running radosgw +
> Apache / FastCGI?

don't think it's going to make much of a difference right now.

Yehuda



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux