Re: RADOS Gateway Issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Yehuda,

I'm wondering if part of the problem is disk I/O?  Running "iotop -o" on the three nodes I get 20MB/s to 100MB/s read on two of them but less than 1MB/s read on another.  All of them have two OSDs, one on each disk, and all are running ceph-mon.

There shouldn't be anything different between them but the level of disk read across them does seem rather high?

Best regards

Graeme  

 
On 22/01/14 16:55, Graeme Lambert wrote:
Hi Yehuda,

With regards to the health status of the cluster, it isn't healthy but I haven't found any way of fixing the placement group errors.  Looking at the ceph health detail it's also showing blocked requests too?

HEALTH_WARN 1 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs stuck unclean; 7 requests are blocked > 32 sec; 3 osds have slow requests; pool cloudstack has too few pgs; pool .rgw.buckets has too few pgs
pg 14.0 is stuck inactive since forever, current state incomplete, last acting [5,0]
pg 14.2 is stuck inactive since forever, current state incomplete, last acting [0,5]
pg 14.6 is stuck inactive since forever, current state down+incomplete, last acting [4,2]
pg 14.0 is stuck unclean since forever, current state incomplete, last acting [5,0]
pg 14.2 is stuck unclean since forever, current state incomplete, last acting [0,5]
pg 14.6 is stuck unclean since forever, current state down+incomplete, last acting [4,2]
pg 14.0 is incomplete, acting [5,0]
pg 14.2 is incomplete, acting [0,5]
pg 14.6 is down+incomplete, acting [4,2]
3 ops are blocked > 8388.61 sec
3 ops are blocked > 4194.3 sec
1 ops are blocked > 2097.15 sec
1 ops are blocked > 8388.61 sec on osd.0
1 ops are blocked > 4194.3 sec on osd.0
2 ops are blocked > 8388.61 sec on osd.4
2 ops are blocked > 4194.3 sec on osd.5
1 ops are blocked > 2097.15 sec on osd.5
3 osds have slow requests
pool cloudstack objects per pg (37316) is more than 27.1587 times cluster average (1374)
pool .rgw.buckets objects per pg (76219) is more than 55.4723 times cluster average (1374)


Ignore the cloudstack pool, I was using cloudstack but not anymore, it's an inactive pool.

Best regards

Graeme


 
On 22/01/14 16:38, Graeme Lambert wrote:
Hi,

Following discussions with people in the IRC I set debug_ms and this is what is being looped over and over when one of them is stuck: http://pastebin.com/KVcpAeYT

Regarding the modules, apache version is 2.2.22-2precise.ceph and the fastcgi mod version is 2.4.7~0910052141-2~bpo70+1.ceph.

Best regards

Graeme  

 
On 22/01/14 16:28, Yehuda Sadeh wrote:
On Wed, Jan 22, 2014 at 8:05 AM, Graeme Lambert <glambert@xxxxxxxxxxx> wrote:
Hi,

I'm using the aws-sdk-for-php classes for the Ceph RADOS gateway but I'm
getting an intermittent issue with the uploading files.

I'm attempting to upload an array of objects to Ceph one by one using the
create_object() function.  It appears to stop randomly when attempting to do
them all, it could stop at the first one, in between or the last one, there
is no pattern to it that I can see.

I'm not getting any PHP errors that indicate an issue and equally there are
no exceptions being caught.

In the radosgw log file, at the time it appears stuck I get:

2014-01-22 15:39:21.656763 7fac44fe1700  1 ====== starting new request
req=0x2417c30 =====

And then sometimes I see:

2014-01-22 15:40:42.490485 7fac99ff9700  1 heartbeat_map is_healthy
'RGWProcess::m_tp thread 0x7fac51ffb700' had timed out after 600

repeated over and over again.

When those messages are appearing, Apache's error log shows:

[Wed Jan 22 15:43:11 2014] [error] [client 172.16.2.149] FastCGI: comm with
server "/var/www/s3gw.fcgi" aborted: idle timeout (30 sec), referer:
https://[server]/[path]

equally over and over again.

I have restarted apache, radosgw, all Ceph OSDs and ceph-mon processes and
still no joy with this.

Can anyone advise on where I'm going wrong with this?

Which fastcgi module are you using? Can you provide a log with 'debug
ms = 1' for a failing request? Usually that kind of message means that
it's waiting for the osd to response, which might point at an
unhealthy cluster.

Yehuda



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux