Re: Multiple OSDs suicide because of client issues?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 23 Nov 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> We set the debugging to 0/0, but are you talking about lines like:
> 
>    -12> 2015-11-20 20:59:47.138746 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.133 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>    -11> 2015-11-20 20:59:47.138749 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.136 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>    -10> 2015-11-20 20:59:47.138751 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.139 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>     -9> 2015-11-20 20:59:47.138758 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.147 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>     -8> 2015-11-20 20:59:47.138761 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.159 since back 2015-11-20
> 20:58:51.427880 front 2015-11-20 20:58:51.427880 (cutoff 2015-11-20
> 20:59:27.138720)
>     -7> 2015-11-20 20:59:47.138789 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.170 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>     -6> 2015-11-20 20:59:47.138794 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.175 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
> 
> There are 10,000 of those lines in the OSD log which shows all the
> logs up to the crash. Unless setting the value to 0/0 is eliminating
> what you are looking for. I've been wondering if setting it to 0/1 or
> 0/5 or even 0/20 has any runtime performance penalty? It seems like
> more detailed info on crashes would be helpful, but we don't want to
> write too much to the SATADOMs.

There is a performance impact but no disk IO (logs are accumulated in 
memory and only flushed out on a crash).

sage



> 
> We do have the NICs bonded all across our environment.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Mon, Nov 23, 2015 at 11:14 AM, Gregory Farnum  wrote:
> > On Mon, Nov 23, 2015 at 12:03 PM, Robert LeBlanc  wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> This is one of our production clusters which is dual 40 Gb Ethernet
> >> using VLANs for cluster and public networks. I don't think this is
> >> unusual, not like my dev cluster which runs Infiniband and IPoIB. The
> >> client nodes are connected at 10 GB Ethernet.
> >>
> >> I wonder if you are talking about the system logs, not the Ceph OSD
> >> logs. I'm attaching a snippet that includes the hour before and after.
> >
> > Nope, I meant the OSD logs. Whenever they crash, it should dump out
> > the last 10000 in-memory log entries ? the one you sent along didn't
> > have a crash included at all. The exact system which timed out will
> > certainly be in those log entries (it's output at level 1, so unless
> > you manually turned everything to 0, it'll show up on a crash.)
> >
> > Anyway, I wouldn't expect that cluster config to have any issues with
> > a client dying since it's TCP over ethernet, but I have seen some
> > weird behaviors out of bonded NICs when one of them dies, so maybe.
> > -Greg
> >
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWU2LkCRDmVDuy+mK58QAA2EUP/22eOBNzAYDV5lGI4J9Z
> wnSZE39UycEfo8e6v8cfikLdAUT7fbY8HBq+VPylLo7OtxA+sGwgjrcz3hzu
> azRi9QuCeWNm+squPQpgISzXWnpDtSjlsA+7iQb+HJGW7/kcR+opixzMX/W5
> AE0Z/hrRwImw3r7Ze3Avl/j+l7iamUznfZAnaBdeWyle7Nge/D8kV+QJSeHe
> /zXDoWW8wPNiRwU/puJrH/GEzyYVZFZ4F9aPUKf9rXsp0chK5k55yysI8ABL
> CfBLtZ1yXPbD20knMdEyuQrDXWMGQplQ+7Z2qFAKsbp+qMFGNqeIbtA6xmbM
> +8RIXT5hTLmgH6lVLYFbk6wgiSphxTVFrkR4Bm6NzFHnloxZ3KuU1pqOZf2k
> iJZ8eDPfUxuforHO2L8TWMDWAsrqTm5A2u0GFtvm7uPWvxWo6sv08sq5IICD
> C75mnCRUIDGl/bQLxt06qvq7WwAtezwnNcwCth3kDFFS85WTgZGEtPgpFizt
> IpBQI4ustiT6lNmYQr6V2cj4HT1G8YBT1ykKwSYmsbRnT2PWGQc7IJ11DxgC
> E7i0c6UYcOMpWT18t+RTOzvv8AZGpna2X/xTJSPL2H10zIkiuXAwO/gZQ5oa
> mgN/3fdhcki8q7uWbZaBCNtv814sZIoTzQy7C7kApQdxFu+kbe5LHRhHZJbZ
> CExf
> =cjG0
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux