Re: Performance dégradation after upgrade to hammer

Florent MONTHEL <fmonthel@xxxxxxxxxxxxx> · Wed, 22 Jul 2015 22:12:29 -0400

Hi Mark

Yes enough PG and no error on Apache logs
We identified some bottleneck on bucket index with huge IOPs on one OSD (IOPs is done on only 1 bucket)

With bucket sharding (32) configured write IOPs us now 5x better (and after bucket delete/create). But we don't yet reach Firefly performance

RedHat case in progress. I will share later with community 

Sent from my iPhone

> On 22 juil. 2015, at 08:20, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> 
> Ok,
> 
> So good news that RADOS appears to be doing well.  I'd say next is to follow some of the recommendations here:
> 
> http://ceph.com/docs/master/radosgw/troubleshooting/
> 
> If you examine the objecter_requests and perfcounters during your cosbench write test, it might help explain where the requests are backing up.  Another thing to look for (as noted in the above URL) are HTTP errors in the apache logs (if relevant).
> 
> Other general thoughts:  When you upgraded to hammer did you change the RGW configuration at all?  Are you using civetweb now?  Does the rgw.buckets pool have enough PGs?
> 
> 
> Mark
> 
>> On 07/21/2015 08:17 PM, Florent MONTHEL wrote:
>> Hi Mark
>> 
>> I've something like 600 write IOPs on EC pool and 800 write IOPs on replicated 3 pool with rados bench
>> 
>> With  Radosgw  I have 30/40 write IOPs with Cosbench (1 radosgw- the same with 2) and servers are sleeping :
>> - 0.005 core for radosgw process
>> - 0.01 core for osd process
>> 
>> I don't know if we can have .rgw* pool locking or something like that with Hammer (or situation specific to me)
>> 
>> On 100% read profile, Radosgw and Ceph servers are working very well with more than 6000 IOPs on one radosgw server :
>> - 7 cores for radosgw process
>> - 1 core for each osd process
>> - 0,5 core for each Apache process
>> 
>> Thanks
>> 
>> Sent from my iPhone
>> 
>>> On 14 juil. 2015, at 21:03, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>>> 
>>> Hi Florent,
>>> 
>>> 10x degradation is definitely unusual!  A couple of things to look at:
>>> 
>>> Are 8K rados bench writes to the rgw.buckets pool slow?  You can with something like:
>>> 
>>> rados -p rgw.buckets bench 30 write -t 256 -b 8192
>>> 
>>> You may also want to try targeting a specific RGW server to make sure the RR-DNS setup isn't interfering (at least while debugging).  It may also be worth creating a new replicated pool and try writes to that pool as well to see if you see much difference.
>>> 
>>> Mark
>>> 
>>>> On 07/14/2015 07:17 PM, Florent MONTHEL wrote:
>>>> Yes of course thanks Mark
>>>> 
>>>> Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb connected - EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed on 2 cluster servers
>>>> No SSD drives used
>>>> 
>>>> We're using Cosbench to send :
>>>> - 8k object size : 100% read with 256 workers : better results with Hammer
>>>>  - 8k object size : 80% read - 20% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10)
>>>> - 8k object size : 100% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10)
>>>> 
>>>> Thanks
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>>> On 14 juil. 2015, at 19:57, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On 07/14/2015 06:42 PM, Florent MONTHEL wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork mode
>>>>>> I'm experiencing huge write performance degradation just after upgrade (Cosbench).
>>>>>> 
>>>>>> Do you already run performance tests between Hammer and Firefly ?
>>>>>> 
>>>>>> No problem with read performance that was amazing
>>>>> 
>>>>> Hi Florent,
>>>>> 
>>>>> Can you talk a little bit about how your write tests are setup?  How many concurrent IOs and what size?  Also, do you see similar problems with rados bench?
>>>>> 
>>>>> We have done some testing and haven't seen significant performance degradation except when switching to civetweb which appears to perform deletes more slowly than what we saw with apache+fcgi.
>>>>> 
>>>>> Mark
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com