RE: Issue on RGW 500 error: flush_read_list(): d->client_c->handle_data() returned -5

"Zhou, Yuan" <yuan.zhou@xxxxxxxxx> · Thu, 20 Jul 2017 14:31:57 +0000

Jens, Ivan and Kyle, thanks for comments! 

We do have a haproxy sit in front of two RGW servers. We did a quick try w/o haproxy and make the clients talking with only one RGW directly, this issue still exists. 

Actually from the log, it seems the request is not that big (len=20135).

This is a flash based cluster so all the bucket index should be on SSD root already.  

So looks like the EC policy we used to be some suspected cause. Will try to debug more with the OSD side.

Thanks, -yuan  

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Kyle Bader
Sent: Thursday, July 20, 2017 3:35 PM
To: yuxiang fang <abcdeffyx@xxxxxxxxx>
Cc: Jens Harbott <j.rosenboom@xxxxxxxx>; Zhou, Yuan <yuan.zhou@xxxxxxxxx>; Ceph Development <ceph-devel@xxxxxxxxxxxxxxx>
Subject: Re: Issue on RGW 500 error: flush_read_list(): d->client_c->handle_data() returned -5

Check your nginx/haproxy logs if that is in place, that can be a spice of timeouts as was previously mentioned.

If the transfer is large, you may want to check kernel tcp timeouts at the kernel level on both ends?

Are your OSDs spinning or SSD, and if the former do you have your index pools on SSD by way of a custom crush rule pointing to an all SSD root?

On Wed, Jul 19, 2017 at 7:39 PM, yuxiang fang <abcdeffyx@xxxxxxxxx> wrote:
> Hi zhou yuan
>
> Get large object with high compression rate will cause EIO in jewel, 
> but it seems that you didn't enable compression.
>
> You are right, there is something wrong with the connection between 
> rgw and the client, is there any middleware between the real client 
> and rgw? (e.g nginx, which may cause such kind of error when download 
> big object)
>
> thanks
> ivan from eisoo
>
>
> On Wed, Jul 19, 2017 at 11:48 PM, Jens Harbott <j.rosenboom@xxxxxxxx> wrote:
>> 2017-07-19 14:45 GMT+00:00 Zhou, Yuan <yuan.zhou@xxxxxxxxx>:
>>> Hello,
>>>
>>> Trying to do some tests with bigdata + S3a + RGW here and ran into a weird 500 error. I checked the log and find the error was due to:
>>>
>>> ERROR: flush_read_list(): d->client_c->handle_data() returned -5
>>>
>>> After checking the code this function seems trying to send data back to client but something wrong happen. Was this because the connection between client and RGW was closed before the data transferring finish?
>>>
>>> Is there a way to do some debug on this?
>>>
>>> I'm testing with Ceph 10.2.7, the data pool was using EC(3 + 2). Any comments would be appreciated!
>>
>> Error 5 is EIO, so it seems like your OSDs couldn't read some of the 
>> data properly. You could either check all of your OSD logs at that 
>> time or add "debug ms=1" to your RGW config and rerun your test, 
>> where you will have more information about the OSD transactions in 
>> your radosgw.log then.
>>
>> In addition, radosgw is known bad at giving feedback to the client 
>> about such errors, see http://tracker.ceph.com/issues/20166.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

-- 

Kyle Bader
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f