I found a couple OSDs that were seeing medium errors and marked them out of the cluster. Once all the PGs were moved off those OSDs all the buffer overflows went away. So there must be some kind of bug that's being triggered when an OSD is misbehaving. Bryan From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Bryan Stillwell <bstillwell@xxxxxxxxxxx> Date: Friday, September 8, 2017 at 9:26 AM To: ceph-users <ceph-users@xxxxxxxxxxxxxx> Subject: radosgw crashing after buffer overflows detected [This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing] For about a week we've been seeing a decent number of buffer overflows detected across all our RGW nodes in one of our clusters. This started happening a day after we started weighing in some new OSD nodes, so we're thinking it's probably related to that. Could someone help us determine the root cause of this? Cluster details: Distro: CentOS 7.2 Release: 0.94.10-0.el7.x86_64 OSDs: 1120 RGW nodes: 10 See log messages below. If you know how to improve the call trace below I would like to hear that too. I tried installing the ceph-debuginfo-0.94.10-0.el7.x86_64 package, but that didn't seem to help. Thanks, Bryan # From /var/log/messages: Sep 7 20:06:11 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 7 21:01:55 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 7 21:37:00 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 7 23:14:54 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 7 23:17:08 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 00:12:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 07:04:07 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 07:17:49 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 07:41:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 07:59:29 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated # From /var/log/ceph/client.radosgw.p3cephrgw003.log: 0> 2017-09-08 07:59:29.696615 7f7b296a2700 -1 *** Caught signal (Aborted) ** in thread 7f7b296a2700 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af) 1: /bin/radosgw() [0x6d3d92] 2: (()+0xf100) [0x7f7f425e9100] 3: (gsignal()+0x37) [0x7f7f4141d5f7] 4: (abort()+0x148) [0x7f7f4141ece8] 5: (()+0x75317) [0x7f7f4145d317] 6: (__fortify_fail()+0x37) [0x7f7f414f5ac7] 7: (()+0x10bc80) [0x7f7f414f3c80] 8: (()+0x10da37) [0x7f7f414f5a37] 9: (OS_Accept()+0xc1) [0x7f7f435bd8b1] 10: (FCGX_Accept_r()+0x9c) [0x7f7f435bb91c] 11: (RGWFCGXProcess::run()+0x7bf) [0x58136f] 12: (RGWProcessControlThread::entry()+0xe) [0x5821fe] 13: (()+0x7dc5) [0x7f7f425e1dc5] 14: (clone()+0x6d) [0x7f7f414de21d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com