On Mon, 23 Nov 2015, Robert LeBlanc wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > We set the debugging to 0/0, but are you talking about lines like: > > -12> 2015-11-20 20:59:47.138746 7f70067de700 -1 osd.177 103793 > heartbeat_check: no reply from osd.133 since back 2015-11-20 > 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 > 20:59:27.138720) > -11> 2015-11-20 20:59:47.138749 7f70067de700 -1 osd.177 103793 > heartbeat_check: no reply from osd.136 since back 2015-11-20 > 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 > 20:59:27.138720) > -10> 2015-11-20 20:59:47.138751 7f70067de700 -1 osd.177 103793 > heartbeat_check: no reply from osd.139 since back 2015-11-20 > 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 > 20:59:27.138720) > -9> 2015-11-20 20:59:47.138758 7f70067de700 -1 osd.177 103793 > heartbeat_check: no reply from osd.147 since back 2015-11-20 > 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 > 20:59:27.138720) > -8> 2015-11-20 20:59:47.138761 7f70067de700 -1 osd.177 103793 > heartbeat_check: no reply from osd.159 since back 2015-11-20 > 20:58:51.427880 front 2015-11-20 20:58:51.427880 (cutoff 2015-11-20 > 20:59:27.138720) > -7> 2015-11-20 20:59:47.138789 7f70067de700 -1 osd.177 103793 > heartbeat_check: no reply from osd.170 since back 2015-11-20 > 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 > 20:59:27.138720) > -6> 2015-11-20 20:59:47.138794 7f70067de700 -1 osd.177 103793 > heartbeat_check: no reply from osd.175 since back 2015-11-20 > 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20 > 20:59:27.138720) > > There are 10,000 of those lines in the OSD log which shows all the > logs up to the crash. Unless setting the value to 0/0 is eliminating > what you are looking for. I've been wondering if setting it to 0/1 or > 0/5 or even 0/20 has any runtime performance penalty? It seems like > more detailed info on crashes would be helpful, but we don't want to > write too much to the SATADOMs. There is a performance impact but no disk IO (logs are accumulated in memory and only flushed out on a crash). sage > > We do have the NICs bonded all across our environment. > - ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Mon, Nov 23, 2015 at 11:14 AM, Gregory Farnum wrote: > > On Mon, Nov 23, 2015 at 12:03 PM, Robert LeBlanc wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA256 > >> > >> This is one of our production clusters which is dual 40 Gb Ethernet > >> using VLANs for cluster and public networks. I don't think this is > >> unusual, not like my dev cluster which runs Infiniband and IPoIB. The > >> client nodes are connected at 10 GB Ethernet. > >> > >> I wonder if you are talking about the system logs, not the Ceph OSD > >> logs. I'm attaching a snippet that includes the hour before and after. > > > > Nope, I meant the OSD logs. Whenever they crash, it should dump out > > the last 10000 in-memory log entries ? the one you sent along didn't > > have a crash included at all. The exact system which timed out will > > certainly be in those log entries (it's output at level 1, so unless > > you manually turned everything to 0, it'll show up on a crash.) > > > > Anyway, I wouldn't expect that cluster config to have any issues with > > a client dying since it's TCP over ethernet, but I have seen some > > weird behaviors out of bonded NICs when one of them dies, so maybe. > > -Greg > > > >> - ---------------- > >> Robert LeBlanc > >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.2.3 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWU2LkCRDmVDuy+mK58QAA2EUP/22eOBNzAYDV5lGI4J9Z > wnSZE39UycEfo8e6v8cfikLdAUT7fbY8HBq+VPylLo7OtxA+sGwgjrcz3hzu > azRi9QuCeWNm+squPQpgISzXWnpDtSjlsA+7iQb+HJGW7/kcR+opixzMX/W5 > AE0Z/hrRwImw3r7Ze3Avl/j+l7iamUznfZAnaBdeWyle7Nge/D8kV+QJSeHe > /zXDoWW8wPNiRwU/puJrH/GEzyYVZFZ4F9aPUKf9rXsp0chK5k55yysI8ABL > CfBLtZ1yXPbD20knMdEyuQrDXWMGQplQ+7Z2qFAKsbp+qMFGNqeIbtA6xmbM > +8RIXT5hTLmgH6lVLYFbk6wgiSphxTVFrkR4Bm6NzFHnloxZ3KuU1pqOZf2k > iJZ8eDPfUxuforHO2L8TWMDWAsrqTm5A2u0GFtvm7uPWvxWo6sv08sq5IICD > C75mnCRUIDGl/bQLxt06qvq7WwAtezwnNcwCth3kDFFS85WTgZGEtPgpFizt > IpBQI4ustiT6lNmYQr6V2cj4HT1G8YBT1ykKwSYmsbRnT2PWGQc7IJ11DxgC > E7i0c6UYcOMpWT18t+RTOzvv8AZGpna2X/xTJSPL2H10zIkiuXAwO/gZQ5oa > mgN/3fdhcki8q7uWbZaBCNtv814sZIoTzQy7C7kApQdxFu+kbe5LHRhHZJbZ > CExf > =cjG0 > -----END PGP SIGNATURE----- > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html