ceph radosgw - 500 errors -- odd

Sean Sullivan <seapasulli@xxxxxxxxxxxx> · Fri, 13 Jan 2017 17:11:54 -0600

I am sorry for posting this if this has been addressed already. I am not sure on how to search through old ceph-users mailing list posts. I used to use gmane.org but that seems to be down. 

My setup::

I have a moderate ceph cluster (ceph hammer 94.9 - fe6d859066244b97b24f09d46552afc2071e6f90 ). The cluster is running ubuntu but the gateways are running centos7 due to an odd memory issue we had across all of our gateways. 

Outside of that the cluster is pretty standard and healthy:

[root@kh11-9 ~]# ceph -s
    cluster XXX-XXX-XXX-XXX
     health HEALTH_OK
     monmap e4: 3 mons at {kh11-8=X.X.X.X:6789/0,kh12-8=X.X.X.X:6789/0,kh13-8=X.X.X.X:6789/0}
            election epoch 150, quorum 0,1,2 kh11-8,kh12-8,kh13-8
     osdmap e69678: 627 osds: 627 up, 627 in

Here is my radosgw config in ceph::

[client.rgw.kh09-10]
log_file = /var/log/radosgw/client.radosgw.log
rgw_frontends = "civetweb port=80 access_log_file=/var/log/radosgw/rgw.access  error_log_file=/var/log/radosgw/rgw.error"
rgw_enable_ops_log = true
rgw_ops_log_rados = true
rgw_thread_pool_size = 1000
rgw_override_bucket_index_max_shards = 23
error_log_file = /var/log/radosgw/civetweb.error.log
access_log_file = /var/log/radosgw/civetweb.access.log
objecter_inflight_op_bytes = 1073741824
objecter_inflight_ops = 20480
ms_dispatch_throttle_bytes = 209715200

The gateways are sitting behind haproxy for ssl termination. Here is my haproxy config:

global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /var/lib/haproxy/admin.sock mode 660 level admin
        stats timeout 30s
        user haproxy
        group haproxy
        daemon
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private
        tune.ssl.default-dh-param 2048
        tune.ssl.maxrecord 2048

        ssl-default-bind-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
        ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
        ssl-default-server-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
        ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        timeout connect 5000
        timeout client  50000
        timeout server  50000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http
        option forwardfor
        option http-server-close

frontend fourfourthree
   bind :443 ssl crt /etc/ssl/STAR.opensciencedatacloud.org.pem
   reqadd X-Forwarded-Proto:\ https
   default_backend radosgw

backend radosgw
   cookie RADOSGWLB insert indirect nocache
   server primary 127.0.0.1:80 check cookie primary

--------------------

I am seeing sporadic 500 errors in my access logs on all of my radosgws:

/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.635645 7feacf6c6700  0 RGWObjManifest::operator++(): result: ofs=12607029248 stripe_ofs=12607029248 part_ofs=12598640640 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.637559 7feacf6c6700  0 RGWObjManifest::operator++(): result: ofs=12611223552 stripe_ofs=12611223552 part_ofs=12598640640 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.642630 7feacf6c6700  0 RGWObjManifest::operator++(): result: ofs=12614369280 stripe_ofs=12614369280 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.644368 7feadf6e6700  1 ====== req done req=0x7fed00053a50 http_status=500 ======
/var/log/radosgw/client.radosgw.log:2017-01-13 11:30:41.644475 7feadf6e6700  1 civetweb: 0x7fed00009340: 10.64.0.124 - - [13/Jan/2017:11:28:24 -0600] "GET /BUCKET/306d4fe1-1515-44e0-b527-eee0e83412bf/306d4fe1-1515-44e0-b527-eee0e83412bf_gdc_realn_rehead.bam HTTP/1.1" 500 0 - Boto/2.36.0 Python/2.7.6 Linux/3.13.0-95-generic
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.645611 7feacf6c6700  0 RGWObjManifest::operator++(): result: ofs=12618563584 stripe_ofs=12618563584 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.647998 7feacf6c6700  0 RGWObjManifest::operator++(): result: ofs=12622757888 stripe_ofs=12622757888 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.650262 7feacf6c6700  0 RGWObjManifest::operator++(): result: ofs=12626952192 stripe_ofs=12626952192 part_ofs=12614369280 rule->part_size=15728640
/var/log/radosgw/client.radosgw.log-2017-01-13 11:30:41.656394 7feacf6c6700  0 RGWObjManifest::operator++(): result: ofs=12630097920 stripe_ofs=12630097920 part_ofs=12630097920 rule->part_size=15728640

I am able to download that file just fine locally using boto but i have heard from some users that the download hangs indefinitely on occasion. The cluster has been healthy afaik (as of graphite showing health_ok) for the entire period. I am not sure why this is happening or how to troubleshoot it further. Obviously rgw is throwing a 500 which to me means an underlying issue with ceph or the rgw server. All of my downloads complete with boto so I am not sure what is wrong or how this is happening. Is there anything I can do to figure out where the 500 is coming from // troubleshoot further? 

-- 
- Sean:  I wrote this. - 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com