Re: Radosgw (civetweb) hangs once around 850 established connections

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ben!

I'm using ubuntu 14.04
I have restarted the gateways with the numthreads line you suggested.  I hope this helps.  I would think I would get some kind of throttle log or something.

500 seems really strange as well.  Do you have a thread for this? RGW still has a weird race condition with multipart uploads where it garbage collects the parts but I think I get a 404 for those which makes sense. I hope you're not seeing something similar. 

Thanks for the tip and good luck! I'll bump this thread when it happens again. 

Sent from my pocket typo cannon. 

On March 16, 2016 8:30:46 PM Ben Hines <bhines@xxxxxxxxx> wrote:

What OS are you using?

I have a lot more open connections than that. (though i have some other issues, where rgw sometimes returns 500 errors, it doesn't stop like yours)

You might try tuning civetweb's num_threads and 'rgw num rados handles':

rgw frontends = civetweb num_threads=125 error_log_file=/var/log/radosgw/civetweb.error.log access_log_file=/var/log/radosgw/civetweb.access.log
rgw num rados handles = 32

You can also up civetweb loglevel:

debug civetweb = 20

-Ben

On Wed, Mar 16, 2016 at 5:03 PM, seapasulli@xxxxxxxxxxxx <seapasulli@xxxxxxxxxxxx> wrote:
I have a cluster of around 630 OSDs with 3 dedicated monitors and 2 dedicated gateways. The entire cluster is running hammer (0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)).

(Both of my gateways have stopped responding to curl right now.
root@host:~# timeout 5 curl localhost ; echo $?
124

>From here I checked and it looks like radosgw has over 1 million open files:
root@host:~# grep -i rados whatisopen.files.list | wc -l
1151753

And around 750 open connections:
root@host:~# netstat -planet | grep radosgw | wc -l
752
root@host:~# ss -tnlap | grep rados | wc -l
752

I don't think that the backend storage is hanging based on the following dump:

root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok objecter_requests | grep -i mtime
            "mtime": "0.000000",
            "mtime": "0.000000",
            "mtime": "0.000000",
            "mtime": "0.000000",
            "mtime": "0.000000",
            "mtime": "0.000000",
            [...]
            "mtime": "0.000000",

The radosgw log is still showing lots of activity and so does strace which makes me think this is a config issue or limit of some kind that is not triggering a log. Of what I am not sure as the log doesn't seem to show any open file limit being hit and I don't see any big errors showing up in the logs.
(last 500 lines of /var/log/radosgw/client.radosgw.log)
http://pastebin.com/jmM1GFSA

Perf dump of radosgw
http://pastebin.com/rjfqkxzE

Radosgw objecter requests:
http://pastebin.com/skDJiyHb

After restarting the gateway with '/etc/init.d/radosgw restart' the old process remains, no error is sent, and then I get connection refused via curl or netcat::
root@kh11-9:~# curl localhost
curl: (7) Failed to connect to localhost port 80: Connection refused

Once I kill the old radosgw via sigkill the new radosgw instance restarts automatically and starts responding::
root@kh11-9:~# curl localhost
<?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyB

What is going on here?




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux