Re: radosgw daemon stalls on download of some files

Yehuda Sadeh <yehuda@xxxxxxxxxxx> · Wed, 18 Dec 2013 08:09:43 -0800

We were actually able to find the culprit yesterday. While the nginx
workaround might be a valid solution (really depends on who nginx
reads from the fastcgi socket), it doesn't fix the underlying radosgw
issue. We'll prepare the fix and get the it on a proper release soon.

Yehuda

On Wed, Dec 18, 2013 at 12:39 AM, Sebastian <webmaster@xxxxxxxx> wrote:
> Hi,
>
> thanks for the nginx config. We did try it and it works like a charm. No more stalling downloads. I wonder why apache does this though.
>
> Sebastian
>
> On 04.12.2013, at 10:53, Yann ROBIN wrote:
>
>> Hi,
>>
>> Our conf :
>> server {
>>        listen  80;
>>        listen  [::]:80;
>>
>>        server_name     radosgw-prod;
>>
>>        client_max_body_size 1000m;
>>        error_log   /var/log/nginx/radosgw-prod-error.log;
>>        access_log  off;
>>
>>
>>        location / {
>>                fastcgi_pass_header     Authorization;
>>                fastcgi_pass_request_headers on;
>>
>>                if ($request_method  = PUT ) {
>>                        rewrite ^       /PUT$request_uri;
>>                }
>>
>>                include fastcgi_params;
>>                client_max_body_size    0;
>>
>>                fastcgi_busy_buffers_size 512k;
>>                fastcgi_buffer_size 512k;
>>                fastcgi_buffers 16 512k;
>>                fastcgi_read_timeout 2s;
>>                fastcgi_send_timeout 1s;
>>                fastcgi_connect_timeout 1s;
>>
>>
>>                fastcgi_next_upstream error timeout http_500 http_503;
>>                fastcgi_pass ceph-rgw;
>>        }
>>
>>        location /PUT/ {
>>                internal;
>>                fastcgi_pass_header     Authorization;
>>                fastcgi_pass_request_headers on;
>>
>>                include fastcgi_params;
>>                client_max_body_size    0;
>>                fastcgi_param  CONTENT_LENGTH   $content_length;
>>
>>                fastcgi_busy_buffers_size 512k;
>>                fastcgi_buffer_size 512k;
>>                fastcgi_buffers 16 512k;
>>
>>                fastcgi_pass ceph-rgw;
>>        }
>> }
>>
>>
>> Content-Length is only sent with PUT request because there was an issue with older version of the radosgateway.
>>
>> DON'T activate keep alive, connection are not closed on the radosgw side when the keep alive option is activated, leading to too much connection open on the rgw.
>> We use this configuration with a tcp socket and not with a local one.
>>
>> -----Original Message-----
>> From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Sebastian
>> Sent: mercredi 4 décembre 2013 10:29
>> To: ceph-users
>> Subject: Re:  radosgw daemon stalls on download of some files
>>
>> Hi,
>>
>> we are currently using the patched fastcgi version (2.4.7-0910042141-6-gd4fffda) Updating to a more recent version is currently blocked by http://tracker.ceph.com/issues/6453
>>
>> Is there a documentation for running radosgw with nginx? I only find some mailinglist posts with some config snippets.
>>
>> Sebastian
>>
>> On 30.11.2013, at 20:46, Andrew Woodward wrote:
>>
>>> Are you using the  inktank patched FastCGI sever?
>>> http://gitbuilder.ceph.com
>>>
>>> Alternately try another script sever like ngnix as already suggested.
>>>
>>> On Nov 29, 2013 12:23 PM, "German Anders" <ganders@xxxxxxxxxxxx> wrote:
>>> Thanks a lot Sebastian, i'm going to try that, also i'm having an issue while trying to test a rbd creation, i've install in the deploy server the ceph-client:
>>>
>>> ceph@ceph-deploy01:/etc/ceph$ sudo rbd -n client.ceph-test -k
>>> /home/ceph/ceph-cluster/ceph.client.admin.keyring create --size 10240
>>> cephdata
>>> 2013-11-29 15:20:25.683930 7fcd9979c780  0 librados:
>>> client.ceph-openstack authentication error (1) Operation not permitted
>>> rbd: couldn't connect to the cluster!
>>>
>>> Anyone know what could be the issue here? maybe it has something to do with keys or maybe not...
>>>
>>> Thanks in advance,
>>>
>>> Best regards,
>>>
>>> German Anders
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> --- Original message ---
>>>> Asunto: Re:  radosgw daemon stalls on download of some
>>>> files
>>>> De: Sebastian <webmaster@xxxxxxxx>
>>>> Para: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>>>> Fecha: Friday, 29/11/2013 16:18
>>>>
>>>> Hi Yehuda,
>>>>
>>>>
>>>>> It's interesting, the responses are received but seems that they
>>>>> aren't being handled (hence the following pings). There are a few
>>>>> things that you could look at. First, try to connect to the admin
>>>>> socket and see if you get any useful information from there. This
>>>>> could include in-flight requests, look for other requests that have
>>>>> not completed. Also see if there's indication for requests throttling.
>>>>
>>>> Do you refer to the methods mentioned here? http://ceph.com/docs/dumpling/radosgw/troubleshooting/?
>>>> Unfortunately the socket file is not present. Do i have to activate it in the config somehow? I could not find any reference to that in the docs. Is it already included in my radosgw version?
>>>> radosgw -v
>>>> ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)
>>>>
>>>>> Another thing to look at would be at the seemingly unrelated timeout
>>>>> messages. These should not happen and might indicate that there's
>>>>> something that is holding you up that shouldn't. Try searching for
>>>>> the same thread id that is specified in these messages (omit the 0x
>>>>> prefix), and see what's the last thing that it's doing.
>>>>
>>>> I checked that:
>>>> http://pastebin.com/Z23PWwjt
>>>> i do not see anything unusual before the messages happen, but maybe you see something odd.
>>>>
>>>>
>>>>> You could also try turning on also 'debug objecter = 20', see if it
>>>>> provides more info (it's very verbose though).
>>>>>
>>>>
>>>> Did that, but that is way to verbose for me ;) I uploaded it here:
>>>> http://pastebin.com/VBPAVP6z
>>>> There might be some requests mixed into it, but the one for cdn/52974400c6dd6ca719000004/source.avi is the one that stalled.
>>>>
>>>>> How much are you loading the gateway before that happens? We've seen
>>>>> a similar issue in the past that was related to the fcgi library
>>>>> that is dynamically linked with the radosgw process (that is, not
>>>>> the apache mod_fastcgi module). This, however, would only happen
>>>>> when there's heavy load and the fd numbers handled by the radosgw
>>>>> surpassed 1024 (buggy library that was using select() instead of poll()).
>>>>
>>>> There are not that many requests on the Storage, maybe 10-20 req/min. The cluster serves as a source for a CDN, so once the resource is fetched it should not be fetched again soon. I checked for the open files, and there are only about 10-20 open file handles for the radosgw process. So this probably is not the issue.
>>>>
>>>> Sebastian
>>>>
>>>>
>>>>>
>>>>> Yehuda
>>>>>
>>>>> On Fri, Nov 29, 2013 at 7:28 AM, Sebastian <webmaster@xxxxxxxx> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> thanks for the hint. I tried this again and noticed that the time out message does seem to be unrelated. Here is the log file for a stalling request with debug turned on:
>>>>>> http://pastebin.com/DcQuc9wP
>>>>>>
>>>>>> I really cannot really find a real "error" in the log. The download stalls at about 500kb at that point though. Restarting radosgw fixes it for 1 download only, the next one is broken again. But as i said this does not happen for all files.
>>>>>>
>>>>>> Sebastian
>>>>>>
>>>>>> On 27.11.2013, at 21:53, Yehuda Sadeh wrote:
>>>>>>
>>>>>>> On Wed, Nov 27, 2013 at 4:46 AM, Sebastian <webmaster@xxxxxxxx> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> we have a setup of 4 Servers running ceph and radosgw. We use it as an internal S3 service for our files. The Servers run Debian Squeeze with Ceph 0.67.4.
>>>>>>>>
>>>>>>>> The cluster has been running smoothly for quite a while, but we are currently experiencing issues with the radosgw. For some files the HTTP Download just stalls at around 500kb.
>>>>>>>>
>>>>>>>> The Apache error log just says:
>>>>>>>> [error] [client ] FastCGI: comm with server "/var/www/s3gw.fcgi"
>>>>>>>> aborted: idle timeout (30 sec) [error] [client ] Handler for
>>>>>>>> fastcgi-script returned invalid result code 1
>>>>>>>>
>>>>>>>> radosgw logging:
>>>>>>>> 7f00bc66a700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread
>>>>>>>> 0x7f00934bb700' had timed out after 600
>>>>>>>> 7f00bc66a700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread
>>>>>>>> 0x7f00ab4eb700' had timed out after 600
>>>>>>>>
>>>>>>>> The interesting thing is that the cluster health is fine an only some files are not working properly. Most of them just work fine. A restart of radosgw fixes the issue. The other ceph logs are also clean.
>>>>>>>>
>>>>>>>> Any idea why this happens?
>>>>>>>>
>>>>>>>
>>>>>>> No, but you can turn on 'debug ms = 1' on your gateway ceph.conf,
>>>>>>> and that might give some better indication.
>>>>>>>
>>>>>>> Yehuda
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com