Re: poor radosgw performance

Yehuda Sadeh <yehuda@xxxxxxxxxxx> · Fri, 20 Sep 2013 13:56:09 -0700

On Fri, Sep 20, 2013 at 1:50 PM, Matt Thompson <wateringcan@xxxxxxxxx> wrote:
>
> Hi Yehuda / Mark,
>
> Thanks for the information!  We will try keystone authentication again when the next dumpling dot release is out.
>
> As for "ceph cache", are you referring to "rgw_cache_enabled"?  If so, we don't have that set in our ceph.conf so should in theory be using it already.

I actually meant to say ceph authentication, not ceph cache.

Yehuda

>
>
>
> On Fri, Sep 20, 2013 at 3:57 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:
>>
>> On Fri, Sep 20, 2013 at 3:51 AM, Matt Thompson <wateringcan@xxxxxxxxx> wrote:
>> > Hi Yehuda,
>> >
>> > I did try bumping up pg_num on .rgw, .rgw.buckets, and .rgw.buckets.index
>> > from 8 to 220 prior to writing to the list but when I saw no difference in
>> > performance I set back to 8 (by creating new pools etc.)
>> >
>> > One thing we have since noticed is that radosgw is validating tokens on each
>> > request; when we use ceph authentication instead we see much more promising
>> > results from swift-bench.
>> >
>> > Is there a known issue w/ keystone token caching in radosgw?  It's my
>> > understanding that 10,000 tokens should be cached by default, however this
>> > doesn't appear to be the case.  I've explicitly set
>> > rgw_keystone_token_cache_size in /etc/ceph/ceph.conf on my radosgw node yet
>> > radosgw continues to hit keystone on each request.
>> >
>>
>> Looking at the code now I think I see the culprit. It's something that
>> was actually fixed in recent versions, but not there in dumpling. I
>> opened a ticket for it (6360) and I'll prepare a fix that will
>> hopefully make it to the next dumpling dot release. In the mean time
>> the way to go would be by using the ceph cache.
>>
>> > Additionally, what does /var/lib/ceph/radosgw/ceph-radosgw.gateway get used
>> > for?  I see the docs mention that it needs to be created, yet it remains
>> > unpopulated on my nodes and doing a quick scan of ceph code I see no
>> > reference to that being used anywhere (thought I may be missing something).
>>
>> That looks like a ceph generic directory that can be used to put your
>> specific user's keyring file (but I might be wrong).
>>
>>
>> >
>> > Thanks again for the help!
>> >
>> > -Matt
>> >
>> >
>> >
>> > On Thu, Sep 19, 2013 at 5:01 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:
>> >>
>> >> On Thu, Sep 19, 2013 at 8:52 AM, Matt Thompson <wateringcan@xxxxxxxxx>
>> >> wrote:
>> >> > Hi All,
>> >> >
>> >> > We're trying to test swift API performance of swift itself (1.9.0) and
>> >> > ceph's radosgw (0.67.3) using the following hardware configuration:
>> >> >
>> >> > Shared servers:
>> >> >
>> >> > * 1 server running keystone for authentication
>> >> > * 1 server running swift-proxy, a single MON, and radosgw + Apache /
>> >> > FastCGI
>> >> >
>> >> > Ceph:
>> >> >
>> >> > * 4 storage servers, 5 storage disks / 5 OSDs on each (no separate
>> >> > disk(s)
>> >> > for journal)
>> >> >
>> >> > Swift:
>> >> >
>> >> > * 4 storage servers, 5 storage disks on each
>> >> >
>> >> > All 10 machines have identical hardware configurations (including drive
>> >> > type
>> >> > & speed).
>> >> >
>> >> > We deployed ceph w/ ceph-deploy and both swift and ceph have default
>> >> > configurations w/ the exception of the following:
>> >> >
>> >> > * custom Inktank packages for apache2 / libapache2-mod-fastcgi
>> >> > * rgw_print_continue enabled
>> >> > * rgw_enable_ops_log disabled
>> >> > * rgw_ops_log_rados disabled
>> >> > * debug_rgw disabled
>> >> >
>> >> > (actually, swift was deployed w/ a chef cookbook, so configurations may
>> >> > be
>> >> > slightly non-standard)
>> >> >
>> >> > On the ceph storage servers, filesystem type (XFS) and filesystem mount
>> >> > options, pg_nums on pools, etc. have all been left with the defaults (8
>> >> > on
>> >> > the radosgw-related pools IIRC).
>> >>
>> >> 8 pgs per pool, especially for the data / index pools is awfully low,
>> >> and probably where your bottleneck is.
>> >>
>> >> >
>> >> > Doing a preliminary test w/ swift-bench (concurrency = 10, object_size =
>> >> > 1),
>> >> > we're seeing the following:
>> >> >
>> >> > Ceph:
>> >> >
>> >> > 1000 PUTS **FINAL** [0 failures], 14.8/s
>> >> > 10000 GETS **FINAL** [0 failures], 40.9/s
>> >> > 1000 DEL **FINAL** [0 failures], 34.6/s
>> >> >
>> >> > Swift:
>> >> >
>> >> > 1000 PUTS **FINAL** [0 failures], 21.7/s
>> >> > 10000 GETS **FINAL** [0 failures], 139.5/s
>> >> > 1000 DEL **FINAL** [0 failures], 85.5/s
>> >> >
>> >> > That's a relatively significant difference.  Would we see any real
>> >> > difference in moving the journals to an SSD per server or separate
>> >> > partition
>> >> > per OSD disk?  These machines are not seeing any load short of what's
>> >> > being
>> >>
>> >> maybe, but I think at this point you're hitting the low pgs issue.
>> >>
>> >> > generated by swift-bench.  Alternatively, would we see any quick wins
>> >> > standing up more MONs or moving the MON off the server running radosgw +
>> >> > Apache / FastCGI?
>> >>
>> >> don't think it's going to make much of a difference right now.
>> >>
>> >> Yehuda
>> >
>> >
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com