Re: radosgw daemon stalls on download of some files

Sebastian <webmaster@xxxxxxxx> · Fri, 29 Nov 2013 16:52:52 +0100

Hi,

our ceph -w is clean:

  cluster e54d66c5-5191-4296-a217-f818e1f92830
   health HEALTH_OK
   monmap e1: 4 mons at {a=5.9.67.9:6789/0,b=5.9.67.8:6789/0,c=5.9.67.7:6789/0,d=5.9.67.6:6789/0}, election epoch 19724, quorum 0,1,2,3 a,b,c,d
   osdmap e1629: 4 osds: 4 up, 4 in
    pgmap v4896303: 1824 pgs: 1824 active+clean; 3203 GB data, 6410 GB used, 3584 GB / 9995 GB avail; 2006KB/s rd, 0op/s
   mdsmap e9529: 1/1/1 up {0=c=up:active}, 3 up:standby

2013-11-29 16:50:34.353780 mon.0 [INF] pgmap v4896303: 1824 pgs: 1824 active+clean; 3203 GB data, 6410 GB used, 3584 GB / 9995 GB avail; 2006KB/s rd, 0op/s

We already tried restarting the whole stack once but that did not help (at least not for long). So no luck there.

Is 0.72 good for production yet? Might that fix it ;)?

Sebastian

On 29.11.2013, at 16:46, Artem Silenkov wrote:

> Good day!
> We ve noticed such things recently during some osd recovery things like scrubbing or so. Restarting OSD did the trick. We had even 404 errors until deep scrubbing ended. 
> Any noise in ceph -w?
> Regards, Artem S.
> 29 нояб. 2013 г. 22:28 пользователь "Sebastian" <webmaster@xxxxxxxx> написал:
> >
> > Hi,
> >
> > thanks for the hint. I tried this again and noticed that the time out message does seem to be unrelated. Here is the log file for a stalling request with debug turned on:
> > http://pastebin.com/DcQuc9wP
> >
> > I really cannot really find a real "error" in the log. The download stalls at about 500kb at that point though. Restarting radosgw fixes it for 1 download only, the next one is broken again. But as i said this does not happen for all files.
> >
> > Sebastian
> >
> > On 27.11.2013, at 21:53, Yehuda Sadeh wrote:
> >
> > > On Wed, Nov 27, 2013 at 4:46 AM, Sebastian <webmaster@xxxxxxxx> wrote:
> > >> Hi,
> > >>
> > >> we have a setup of 4 Servers running ceph and radosgw. We use it as an internal S3 service for our files. The Servers run Debian Squeeze with Ceph 0.67.4.
> > >>
> > >> The cluster has been running smoothly for quite a while, but we are currently experiencing issues with the radosgw. For some files the HTTP Download just stalls at around 500kb.
> > >>
> > >> The Apache error log just says:
> > >> [error] [client ] FastCGI: comm with server "/var/www/s3gw.fcgi" aborted: idle timeout (30 sec)
> > >> [error] [client ] Handler for fastcgi-script returned invalid result code 1
> > >>
> > >> radosgw logging:
> > >> 7f00bc66a700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f00934bb700' had timed out after 600
> > >> 7f00bc66a700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f00ab4eb700' had timed out after 600
> > >>
> > >> The interesting thing is that the cluster health is fine an only some files are not working properly. Most of them just work fine. A restart of radosgw fixes the issue. The other ceph logs are also clean.
> > >>
> > >> Any idea why this happens?
> > >>
> > >
> > > No, but you can turn on 'debug ms = 1' on your gateway ceph.conf, and
> > > that might give some better indication.
> > >
> > > Yehuda
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com