-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 We set the debugging to 0/0, but are you talking about lines like: -12> 2015-11-20 20:59:47.138746 7f70067de700 -1 osd.177 103793 heartbeat_check: no reply from osd.133 since back 2015-11-20 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 20:59:27.138720) -11> 2015-11-20 20:59:47.138749 7f70067de700 -1 osd.177 103793 heartbeat_check: no reply from osd.136 since back 2015-11-20 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 20:59:27.138720) -10> 2015-11-20 20:59:47.138751 7f70067de700 -1 osd.177 103793 heartbeat_check: no reply from osd.139 since back 2015-11-20 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 20:59:27.138720) -9> 2015-11-20 20:59:47.138758 7f70067de700 -1 osd.177 103793 heartbeat_check: no reply from osd.147 since back 2015-11-20 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 20:59:27.138720) -8> 2015-11-20 20:59:47.138761 7f70067de700 -1 osd.177 103793 heartbeat_check: no reply from osd.159 since back 2015-11-20 20:58:51.427880 front 2015-11-20 20:58:51.427880 (cutoff 2015-11-20 20:59:27.138720) -7> 2015-11-20 20:59:47.138789 7f70067de700 -1 osd.177 103793 heartbeat_check: no reply from osd.170 since back 2015-11-20 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 20:59:27.138720) -6> 2015-11-20 20:59:47.138794 7f70067de700 -1 osd.177 103793 heartbeat_check: no reply from osd.175 since back 2015-11-20 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 20:59:27.138720) There are 10,000 of those lines in the OSD log which shows all the logs up to the crash. Unless setting the value to 0/0 is eliminating what you are looking for. I've been wondering if setting it to 0/1 or 0/5 or even 0/20 has any runtime performance penalty? It seems like more detailed info on crashes would be helpful, but we don't want to write too much to the SATADOMs. We do have the NICs bonded all across our environment. - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Nov 23, 2015 at 11:14 AM, Gregory Farnum wrote: > On Mon, Nov 23, 2015 at 12:03 PM, Robert LeBlanc wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> This is one of our production clusters which is dual 40 Gb Ethernet >> using VLANs for cluster and public networks. I don't think this is >> unusual, not like my dev cluster which runs Infiniband and IPoIB. The >> client nodes are connected at 10 GB Ethernet. >> >> I wonder if you are talking about the system logs, not the Ceph OSD >> logs. I'm attaching a snippet that includes the hour before and after. > > Nope, I meant the OSD logs. Whenever they crash, it should dump out > the last 10000 in-memory log entries — the one you sent along didn't > have a crash included at all. The exact system which timed out will > certainly be in those log entries (it's output at level 1, so unless > you manually turned everything to 0, it'll show up on a crash.) > > Anyway, I wouldn't expect that cluster config to have any issues with > a client dying since it's TCP over ethernet, but I have seen some > weird behaviors out of bonded NICs when one of them dies, so maybe. > -Greg > >> - ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWU2LkCRDmVDuy+mK58QAA2EUP/22eOBNzAYDV5lGI4J9Z wnSZE39UycEfo8e6v8cfikLdAUT7fbY8HBq+VPylLo7OtxA+sGwgjrcz3hzu azRi9QuCeWNm+squPQpgISzXWnpDtSjlsA+7iQb+HJGW7/kcR+opixzMX/W5 AE0Z/hrRwImw3r7Ze3Avl/j+l7iamUznfZAnaBdeWyle7Nge/D8kV+QJSeHe /zXDoWW8wPNiRwU/puJrH/GEzyYVZFZ4F9aPUKf9rXsp0chK5k55yysI8ABL CfBLtZ1yXPbD20knMdEyuQrDXWMGQplQ+7Z2qFAKsbp+qMFGNqeIbtA6xmbM +8RIXT5hTLmgH6lVLYFbk6wgiSphxTVFrkR4Bm6NzFHnloxZ3KuU1pqOZf2k iJZ8eDPfUxuforHO2L8TWMDWAsrqTm5A2u0GFtvm7uPWvxWo6sv08sq5IICD C75mnCRUIDGl/bQLxt06qvq7WwAtezwnNcwCth3kDFFS85WTgZGEtPgpFizt IpBQI4ustiT6lNmYQr6V2cj4HT1G8YBT1ykKwSYmsbRnT2PWGQc7IJ11DxgC E7i0c6UYcOMpWT18t+RTOzvv8AZGpna2X/xTJSPL2H10zIkiuXAwO/gZQ5oa mgN/3fdhcki8q7uWbZaBCNtv814sZIoTzQy7C7kApQdxFu+kbe5LHRhHZJbZ CExf =cjG0 -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html