Re: RADOS Gateway Issues

Graeme Lambert <glambert@xxxxxxxxxxx> · Wed, 22 Jan 2014 17:41:43 +0000

    Hi Yehuda,

      I'm wondering if part of the problem is disk I/O?  Running "iotop
      -o" on the three nodes I get 20MB/s to 100MB/s read on two of them
      but less than 1MB/s read on another.  All of them have two OSDs,
      one on each disk, and all are running ceph-mon.

      There shouldn't be anything different between them but the level
      of disk read across them does seem rather high?

        Best regards
        Graeme  

      On 22/01/14 16:55, Graeme Lambert wrote:

      Hi Yehuda,

        With regards to the health status of the cluster, it isn't
        healthy but I haven't found any way of fixing the placement
        group errors.  Looking at the ceph health detail it's also
        showing blocked requests too?

        HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive;
        3 pgs stuck unclean; 7 requests are blocked > 32 sec; 3 osds
        have slow requests; pool cloudstack has too few pgs; pool
        .rgw.buckets has too few pgs

        pg 14.0 is stuck inactive since forever, current state
        incomplete, last acting [5,0]

        pg 14.2 is stuck inactive since forever, current state
        incomplete, last acting [0,5]

        pg 14.6 is stuck inactive since forever, current state
        down+incomplete, last acting [4,2]

        pg 14.0 is stuck unclean since forever, current state
        incomplete, last acting [5,0]

        pg 14.2 is stuck unclean since forever, current state
        incomplete, last acting [0,5]

        pg 14.6 is stuck unclean since forever, current state
        down+incomplete, last acting [4,2]

        pg 14.0 is incomplete, acting [5,0]

        pg 14.2 is incomplete, acting [0,5]

        pg 14.6 is down+incomplete, acting [4,2]

        3 ops are blocked > 8388.61 sec

        3 ops are blocked > 4194.3 sec

        1 ops are blocked > 2097.15 sec

        1 ops are blocked > 8388.61 sec on osd.0

        1 ops are blocked > 4194.3 sec on osd.0

        2 ops are blocked > 8388.61 sec on osd.4

        2 ops are blocked > 4194.3 sec on osd.5

        1 ops are blocked > 2097.15 sec on osd.5

        3 osds have slow requests

        pool cloudstack objects per pg (37316) is more than 27.1587
        times cluster average (1374)

        pool .rgw.buckets objects per pg (76219) is more than 55.4723
        times cluster average (1374)

        Ignore the cloudstack pool, I was using cloudstack but not
        anymore, it's an inactive pool.

          Best regards
          Graeme

        On 22/01/14 16:38, Graeme Lambert wrote:

        Hi,

          Following discussions with people in the IRC I set debug_ms
          and this is what is being looped over and over when one of
          them is stuck:

          http://pastebin.com/KVcpAeYT

          Regarding the modules, apache version is 2.2.22-2precise.ceph
          and the fastcgi mod version is
          2.4.7~0910052141-2~bpo70+1.ceph.

            Best regards
            Graeme  

          On 22/01/14 16:28, Yehuda Sadeh wrote:

          On Wed, Jan 22, 2014 at 8:05 AM, Graeme Lambert <glambert@xxxxxxxxxxx> wrote:

            Hi,

I'm using the aws-sdk-for-php classes for the Ceph RADOS gateway but I'm
getting an intermittent issue with the uploading files.

I'm attempting to upload an array of objects to Ceph one by one using the
create_object() function.  It appears to stop randomly when attempting to do
them all, it could stop at the first one, in between or the last one, there
is no pattern to it that I can see.

I'm not getting any PHP errors that indicate an issue and equally there are
no exceptions being caught.

In the radosgw log file, at the time it appears stuck I get:

2014-01-22 15:39:21.656763 7fac44fe1700  1 ====== starting new request
req=0x2417c30 =====

And then sometimes I see:

2014-01-22 15:40:42.490485 7fac99ff9700  1 heartbeat_map is_healthy
'RGWProcess::m_tp thread 0x7fac51ffb700' had timed out after 600

repeated over and over again.

When those messages are appearing, Apache's error log shows:

[Wed Jan 22 15:43:11 2014] [error] [client 172.16.2.149] FastCGI: comm with
server "/var/www/s3gw.fcgi" aborted: idle timeout (30 sec), referer:
https://[server]/[path]

equally over and over again.

I have restarted apache, radosgw, all Ceph OSDs and ceph-mon processes and
still no joy with this.

Can anyone advise on where I'm going wrong with this?

          Which fastcgi module are you using? Can you provide a log with 'debug
ms = 1' for a failing request? Usually that kind of message means that
it's waiting for the osd to response, which might point at an
unhealthy cluster.

Yehuda

        _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com