Re: Radosgw (civetweb) hangs once around 850 established connections

"seapasulli@xxxxxxxxxxxx" <seapasulli@xxxxxxxxxxxx> · Thu, 31 Mar 2016 01:02:32 -0500

Thanks Dan!

Thanks for this.  I didn't know /proc/procid/limits was here! Super useful!!

Here are my limits::
root@kh11-9:~# cat /proc/419990/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             515007               515007 
processes
Max open files            1048576              1048576              files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       515007               515007               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

root@kh11-9:~# lsof -p 419990 | wc -l
600

root@kh11-9:~# ps -o nlwp 419990
NLWP
1251

root@kh11-9:~# ps -eo nlwp | tail -n +2 | awk '{ sum += $1 } END { print 
sum }'
1585

root@kh11-9:~# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515007
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 515007
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

root@kh11-9:~# ls /proc/419990/fd/ | wc -l
536

I think this is a systemwide config issue as even after I restart 
radosgw this issue doens't go away entirely and seems to linger, I just 
have no idea what it could be.

Prior to this behavior happening I can almost fully saturate my network 
link to near 10Gbps. After the behavior starts happening I can not even 
wget a 100mb bin file. It ends up taking hours. Small wgets complete 
though and I can curl a plain <html><body>test</body></html> webpage 
without any issue. Speed is greatly reduced though.

The rest of the server seems to behave fine (sans the newly discovered 
download issue)

On March 30, 2016 5:34:25 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

Hi Sean,

Did you check that the process isn't hitting some ulimits? cat
/proc/`pidof radosgw`/limits and compare with the num processes/num
FDs in use.

Cheers, Dan

On Tue, Mar 29, 2016 at 8:35 PM, seapasulli@xxxxxxxxxxxx
<seapasulli@xxxxxxxxxxxx> wrote:
So an update for anyone else having this issue. It looks like radosgw either
has a memory leak or it spools the whole object into ram or something.

root@kh11-9:/etc/apt/sources.list.d# free -m
             total       used       free     shared    buffers cached
Mem:         64397      63775        621          0 3         46
-/+ buffers/cache:      63725        671
Swap:        65499      17630      47869

root@kh11-9:/etc/apt/sources.list.d# ps faux | grep -iE "USE[R]|radosg[w]"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      269910  134 95.2 90622120 62819128 ?   Ssl  12:31  79:37
/usr/bin/radosgw --cluster=ceph --id rgw.kh11-9 -f

The odd things are 1.) the disk is fine. 2.) the rest of the server seems
very responsive. I can ssh into the server without any problems, curl out,
wget, etc but radosgw is stuck in the mud

This is after 150-300 wget requests to public objects, 2 radosgws freeze
like this.  The cluster is health okay as well::

root@kh11-9:~# grep -iE "health" ceph_report.json
    "health": {
        "health": {
            "health_services": [
                            "health": "HEALTH_OK"
                            "health": "HEALTH_OK"
                            "health": "HEALTH_OK"
                    "health": "HEALTH_OK"
                    "health": "HEALTH_OK"
                    "health": "HEALTH_OK"
        "overall_status": "HEALTH_OK",

Has anyone seen this behavior before?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com