We were actually able to find the culprit yesterday. While the nginx workaround might be a valid solution (really depends on who nginx reads from the fastcgi socket), it doesn't fix the underlying radosgw issue. We'll prepare the fix and get the it on a proper release soon. Yehuda On Wed, Dec 18, 2013 at 12:39 AM, Sebastian <webmaster@xxxxxxxx> wrote: > Hi, > > thanks for the nginx config. We did try it and it works like a charm. No more stalling downloads. I wonder why apache does this though. > > Sebastian > > On 04.12.2013, at 10:53, Yann ROBIN wrote: > >> Hi, >> >> Our conf : >> server { >> listen 80; >> listen [::]:80; >> >> server_name radosgw-prod; >> >> client_max_body_size 1000m; >> error_log /var/log/nginx/radosgw-prod-error.log; >> access_log off; >> >> >> location / { >> fastcgi_pass_header Authorization; >> fastcgi_pass_request_headers on; >> >> if ($request_method = PUT ) { >> rewrite ^ /PUT$request_uri; >> } >> >> include fastcgi_params; >> client_max_body_size 0; >> >> fastcgi_busy_buffers_size 512k; >> fastcgi_buffer_size 512k; >> fastcgi_buffers 16 512k; >> fastcgi_read_timeout 2s; >> fastcgi_send_timeout 1s; >> fastcgi_connect_timeout 1s; >> >> >> fastcgi_next_upstream error timeout http_500 http_503; >> fastcgi_pass ceph-rgw; >> } >> >> location /PUT/ { >> internal; >> fastcgi_pass_header Authorization; >> fastcgi_pass_request_headers on; >> >> include fastcgi_params; >> client_max_body_size 0; >> fastcgi_param CONTENT_LENGTH $content_length; >> >> fastcgi_busy_buffers_size 512k; >> fastcgi_buffer_size 512k; >> fastcgi_buffers 16 512k; >> >> fastcgi_pass ceph-rgw; >> } >> } >> >> >> Content-Length is only sent with PUT request because there was an issue with older version of the radosgateway. >> >> DON'T activate keep alive, connection are not closed on the radosgw side when the keep alive option is activated, leading to too much connection open on the rgw. >> We use this configuration with a tcp socket and not with a local one. >> >> -----Original Message----- >> From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Sebastian >> Sent: mercredi 4 décembre 2013 10:29 >> To: ceph-users >> Subject: Re: radosgw daemon stalls on download of some files >> >> Hi, >> >> we are currently using the patched fastcgi version (2.4.7-0910042141-6-gd4fffda) Updating to a more recent version is currently blocked by http://tracker.ceph.com/issues/6453 >> >> Is there a documentation for running radosgw with nginx? I only find some mailinglist posts with some config snippets. >> >> Sebastian >> >> On 30.11.2013, at 20:46, Andrew Woodward wrote: >> >>> Are you using the inktank patched FastCGI sever? >>> http://gitbuilder.ceph.com >>> >>> Alternately try another script sever like ngnix as already suggested. >>> >>> On Nov 29, 2013 12:23 PM, "German Anders" <ganders@xxxxxxxxxxxx> wrote: >>> Thanks a lot Sebastian, i'm going to try that, also i'm having an issue while trying to test a rbd creation, i've install in the deploy server the ceph-client: >>> >>> ceph@ceph-deploy01:/etc/ceph$ sudo rbd -n client.ceph-test -k >>> /home/ceph/ceph-cluster/ceph.client.admin.keyring create --size 10240 >>> cephdata >>> 2013-11-29 15:20:25.683930 7fcd9979c780 0 librados: >>> client.ceph-openstack authentication error (1) Operation not permitted >>> rbd: couldn't connect to the cluster! >>> >>> Anyone know what could be the issue here? maybe it has something to do with keys or maybe not... >>> >>> Thanks in advance, >>> >>> Best regards, >>> >>> German Anders >>> >>> >>> >>> >>> >>> >>> >>>> --- Original message --- >>>> Asunto: Re: radosgw daemon stalls on download of some >>>> files >>>> De: Sebastian <webmaster@xxxxxxxx> >>>> Para: ceph-users <ceph-users@xxxxxxxxxxxxxx> >>>> Fecha: Friday, 29/11/2013 16:18 >>>> >>>> Hi Yehuda, >>>> >>>> >>>>> It's interesting, the responses are received but seems that they >>>>> aren't being handled (hence the following pings). There are a few >>>>> things that you could look at. First, try to connect to the admin >>>>> socket and see if you get any useful information from there. This >>>>> could include in-flight requests, look for other requests that have >>>>> not completed. Also see if there's indication for requests throttling. >>>> >>>> Do you refer to the methods mentioned here? http://ceph.com/docs/dumpling/radosgw/troubleshooting/? >>>> Unfortunately the socket file is not present. Do i have to activate it in the config somehow? I could not find any reference to that in the docs. Is it already included in my radosgw version? >>>> radosgw -v >>>> ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7) >>>> >>>>> Another thing to look at would be at the seemingly unrelated timeout >>>>> messages. These should not happen and might indicate that there's >>>>> something that is holding you up that shouldn't. Try searching for >>>>> the same thread id that is specified in these messages (omit the 0x >>>>> prefix), and see what's the last thing that it's doing. >>>> >>>> I checked that: >>>> http://pastebin.com/Z23PWwjt >>>> i do not see anything unusual before the messages happen, but maybe you see something odd. >>>> >>>> >>>>> You could also try turning on also 'debug objecter = 20', see if it >>>>> provides more info (it's very verbose though). >>>>> >>>> >>>> Did that, but that is way to verbose for me ;) I uploaded it here: >>>> http://pastebin.com/VBPAVP6z >>>> There might be some requests mixed into it, but the one for cdn/52974400c6dd6ca719000004/source.avi is the one that stalled. >>>> >>>>> How much are you loading the gateway before that happens? We've seen >>>>> a similar issue in the past that was related to the fcgi library >>>>> that is dynamically linked with the radosgw process (that is, not >>>>> the apache mod_fastcgi module). This, however, would only happen >>>>> when there's heavy load and the fd numbers handled by the radosgw >>>>> surpassed 1024 (buggy library that was using select() instead of poll()). >>>> >>>> There are not that many requests on the Storage, maybe 10-20 req/min. The cluster serves as a source for a CDN, so once the resource is fetched it should not be fetched again soon. I checked for the open files, and there are only about 10-20 open file handles for the radosgw process. So this probably is not the issue. >>>> >>>> Sebastian >>>> >>>> >>>>> >>>>> Yehuda >>>>> >>>>> On Fri, Nov 29, 2013 at 7:28 AM, Sebastian <webmaster@xxxxxxxx> wrote: >>>>>> Hi, >>>>>> >>>>>> thanks for the hint. I tried this again and noticed that the time out message does seem to be unrelated. Here is the log file for a stalling request with debug turned on: >>>>>> http://pastebin.com/DcQuc9wP >>>>>> >>>>>> I really cannot really find a real "error" in the log. The download stalls at about 500kb at that point though. Restarting radosgw fixes it for 1 download only, the next one is broken again. But as i said this does not happen for all files. >>>>>> >>>>>> Sebastian >>>>>> >>>>>> On 27.11.2013, at 21:53, Yehuda Sadeh wrote: >>>>>> >>>>>>> On Wed, Nov 27, 2013 at 4:46 AM, Sebastian <webmaster@xxxxxxxx> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> we have a setup of 4 Servers running ceph and radosgw. We use it as an internal S3 service for our files. The Servers run Debian Squeeze with Ceph 0.67.4. >>>>>>>> >>>>>>>> The cluster has been running smoothly for quite a while, but we are currently experiencing issues with the radosgw. For some files the HTTP Download just stalls at around 500kb. >>>>>>>> >>>>>>>> The Apache error log just says: >>>>>>>> [error] [client ] FastCGI: comm with server "/var/www/s3gw.fcgi" >>>>>>>> aborted: idle timeout (30 sec) [error] [client ] Handler for >>>>>>>> fastcgi-script returned invalid result code 1 >>>>>>>> >>>>>>>> radosgw logging: >>>>>>>> 7f00bc66a700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread >>>>>>>> 0x7f00934bb700' had timed out after 600 >>>>>>>> 7f00bc66a700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread >>>>>>>> 0x7f00ab4eb700' had timed out after 600 >>>>>>>> >>>>>>>> The interesting thing is that the cluster health is fine an only some files are not working properly. Most of them just work fine. A restart of radosgw fixes the issue. The other ceph logs are also clean. >>>>>>>> >>>>>>>> Any idea why this happens? >>>>>>>> >>>>>>> >>>>>>> No, but you can turn on 'debug ms = 1' on your gateway ceph.conf, >>>>>>> and that might give some better indication. >>>>>>> >>>>>>> Yehuda >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com