Re: radosgw daemon stalls on download of some files

Sebastian <webmaster@xxxxxxxx> · Wed, 18 Dec 2013 09:39:46 +0100

Hi,

thanks for the nginx config. We did try it and it works like a charm. No more stalling downloads. I wonder why apache does this though. 

Sebastian

On 04.12.2013, at 10:53, Yann ROBIN wrote:

> Hi,
> 
> Our conf :
> server {
>        listen  80;
>        listen  [::]:80;
> 
>        server_name     radosgw-prod;
> 
>        client_max_body_size 1000m;
>        error_log   /var/log/nginx/radosgw-prod-error.log;
>        access_log  off;
> 
> 
>        location / {
>                fastcgi_pass_header     Authorization;
>                fastcgi_pass_request_headers on;
> 
>                if ($request_method  = PUT ) {
>                        rewrite ^       /PUT$request_uri;
>                }
> 
>                include fastcgi_params;
>                client_max_body_size    0;
> 
>                fastcgi_busy_buffers_size 512k;
>                fastcgi_buffer_size 512k;
>                fastcgi_buffers 16 512k;
>                fastcgi_read_timeout 2s;
>                fastcgi_send_timeout 1s;
>                fastcgi_connect_timeout 1s;
> 
> 
>                fastcgi_next_upstream error timeout http_500 http_503;
>                fastcgi_pass ceph-rgw;
>        }
> 
>        location /PUT/ {
>                internal;
>                fastcgi_pass_header     Authorization;
>                fastcgi_pass_request_headers on;
> 
>                include fastcgi_params;
>                client_max_body_size    0;
>                fastcgi_param  CONTENT_LENGTH   $content_length;
> 
>                fastcgi_busy_buffers_size 512k;
>                fastcgi_buffer_size 512k;
>                fastcgi_buffers 16 512k;
> 
>                fastcgi_pass ceph-rgw;
>        }
> }
> 
> 
> Content-Length is only sent with PUT request because there was an issue with older version of the radosgateway.
> 
> DON'T activate keep alive, connection are not closed on the radosgw side when the keep alive option is activated, leading to too much connection open on the rgw.
> We use this configuration with a tcp socket and not with a local one.
> 
> -----Original Message-----
> From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Sebastian
> Sent: mercredi 4 décembre 2013 10:29
> To: ceph-users
> Subject: Re:  radosgw daemon stalls on download of some files
> 
> Hi,
> 
> we are currently using the patched fastcgi version (2.4.7-0910042141-6-gd4fffda) Updating to a more recent version is currently blocked by http://tracker.ceph.com/issues/6453
> 
> Is there a documentation for running radosgw with nginx? I only find some mailinglist posts with some config snippets. 
> 
> Sebastian
> 
> On 30.11.2013, at 20:46, Andrew Woodward wrote:
> 
>> Are you using the  inktank patched FastCGI sever? 
>> http://gitbuilder.ceph.com
>> 
>> Alternately try another script sever like ngnix as already suggested.
>> 
>> On Nov 29, 2013 12:23 PM, "German Anders" <ganders@xxxxxxxxxxxx> wrote:
>> Thanks a lot Sebastian, i'm going to try that, also i'm having an issue while trying to test a rbd creation, i've install in the deploy server the ceph-client:
>> 
>> ceph@ceph-deploy01:/etc/ceph$ sudo rbd -n client.ceph-test -k 
>> /home/ceph/ceph-cluster/ceph.client.admin.keyring create --size 10240 
>> cephdata
>> 2013-11-29 15:20:25.683930 7fcd9979c780  0 librados: 
>> client.ceph-openstack authentication error (1) Operation not permitted
>> rbd: couldn't connect to the cluster!
>> 
>> Anyone know what could be the issue here? maybe it has something to do with keys or maybe not...
>> 
>> Thanks in advance,
>> 
>> Best regards,
>> 
>> German Anders
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> --- Original message ---
>>> Asunto: Re:  radosgw daemon stalls on download of some 
>>> files
>>> De: Sebastian <webmaster@xxxxxxxx>
>>> Para: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>>> Fecha: Friday, 29/11/2013 16:18
>>> 
>>> Hi Yehuda,
>>> 
>>> 
>>>> It's interesting, the responses are received but seems that they 
>>>> aren't being handled (hence the following pings). There are a few 
>>>> things that you could look at. First, try to connect to the admin 
>>>> socket and see if you get any useful information from there. This 
>>>> could include in-flight requests, look for other requests that have 
>>>> not completed. Also see if there's indication for requests throttling.
>>> 
>>> Do you refer to the methods mentioned here? http://ceph.com/docs/dumpling/radosgw/troubleshooting/?
>>> Unfortunately the socket file is not present. Do i have to activate it in the config somehow? I could not find any reference to that in the docs. Is it already included in my radosgw version?
>>> radosgw -v
>>> ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)
>>> 
>>>> Another thing to look at would be at the seemingly unrelated timeout 
>>>> messages. These should not happen and might indicate that there's 
>>>> something that is holding you up that shouldn't. Try searching for 
>>>> the same thread id that is specified in these messages (omit the 0x 
>>>> prefix), and see what's the last thing that it's doing.
>>> 
>>> I checked that: 
>>> http://pastebin.com/Z23PWwjt
>>> i do not see anything unusual before the messages happen, but maybe you see something odd. 
>>> 
>>> 
>>>> You could also try turning on also 'debug objecter = 20', see if it 
>>>> provides more info (it's very verbose though).
>>>> 
>>> 
>>> Did that, but that is way to verbose for me ;) I uploaded it here:
>>> http://pastebin.com/VBPAVP6z
>>> There might be some requests mixed into it, but the one for cdn/52974400c6dd6ca719000004/source.avi is the one that stalled. 
>>> 
>>>> How much are you loading the gateway before that happens? We've seen 
>>>> a similar issue in the past that was related to the fcgi library 
>>>> that is dynamically linked with the radosgw process (that is, not 
>>>> the apache mod_fastcgi module). This, however, would only happen 
>>>> when there's heavy load and the fd numbers handled by the radosgw 
>>>> surpassed 1024 (buggy library that was using select() instead of poll()).
>>> 
>>> There are not that many requests on the Storage, maybe 10-20 req/min. The cluster serves as a source for a CDN, so once the resource is fetched it should not be fetched again soon. I checked for the open files, and there are only about 10-20 open file handles for the radosgw process. So this probably is not the issue. 
>>> 
>>> Sebastian
>>> 
>>> 
>>>> 
>>>> Yehuda
>>>> 
>>>> On Fri, Nov 29, 2013 at 7:28 AM, Sebastian <webmaster@xxxxxxxx> wrote:
>>>>> Hi,
>>>>> 
>>>>> thanks for the hint. I tried this again and noticed that the time out message does seem to be unrelated. Here is the log file for a stalling request with debug turned on:
>>>>> http://pastebin.com/DcQuc9wP
>>>>> 
>>>>> I really cannot really find a real "error" in the log. The download stalls at about 500kb at that point though. Restarting radosgw fixes it for 1 download only, the next one is broken again. But as i said this does not happen for all files.
>>>>> 
>>>>> Sebastian
>>>>> 
>>>>> On 27.11.2013, at 21:53, Yehuda Sadeh wrote:
>>>>> 
>>>>>> On Wed, Nov 27, 2013 at 4:46 AM, Sebastian <webmaster@xxxxxxxx> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> we have a setup of 4 Servers running ceph and radosgw. We use it as an internal S3 service for our files. The Servers run Debian Squeeze with Ceph 0.67.4.
>>>>>>> 
>>>>>>> The cluster has been running smoothly for quite a while, but we are currently experiencing issues with the radosgw. For some files the HTTP Download just stalls at around 500kb.
>>>>>>> 
>>>>>>> The Apache error log just says:
>>>>>>> [error] [client ] FastCGI: comm with server "/var/www/s3gw.fcgi" 
>>>>>>> aborted: idle timeout (30 sec) [error] [client ] Handler for 
>>>>>>> fastcgi-script returned invalid result code 1
>>>>>>> 
>>>>>>> radosgw logging:
>>>>>>> 7f00bc66a700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 
>>>>>>> 0x7f00934bb700' had timed out after 600
>>>>>>> 7f00bc66a700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 
>>>>>>> 0x7f00ab4eb700' had timed out after 600
>>>>>>> 
>>>>>>> The interesting thing is that the cluster health is fine an only some files are not working properly. Most of them just work fine. A restart of radosgw fixes the issue. The other ceph logs are also clean.
>>>>>>> 
>>>>>>> Any idea why this happens?
>>>>>>> 
>>>>>> 
>>>>>> No, but you can turn on 'debug ms = 1' on your gateway ceph.conf, 
>>>>>> and that might give some better indication.
>>>>>> 
>>>>>> Yehuda
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com