Re: syslog broke my cluster

"Sergio A. de Carvalho Jr." <scarvalhojr@xxxxxxxxx> · Wed, 27 Jul 2016 12:05:17 +0100

I guess the point I was trying to make is that, ideally, Ceph would isolate its logging system in a way that a problem with writing the logs wouldn't affect the operation of the core Ceph services.
In my case, all other services running on the machine (ssh, ntp, cron, etc.) are operating normally, even though the logs might not be getting pushed out to the central syslog servers.

On Wed, Jul 27, 2016 at 4:49 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
On Tue, Jul 26, 2016 at 03:48:33PM +0100, Sergio A. de Carvalho Jr. wrote:

> As per my previous messages on the list, I was having a strange problem in

> my test cluster (Hammer 0.94.6, CentOS 6.5) where my monitors were

> literally crawling to a halt, preventing them to ever reach quorum and

> causing all sort of problems. As it turned out, to my surprise everything

> went back to normal as soon as I turned off syslog -- special thanks to

> Sean!

>

> The slowdown with syslog on was so severe that logs were being written with

> a timestamp that was several minutes (and eventually up to hours) behind

> the system clock. The logs from my 4 monitors can be seen in the links

> below:

>

> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406

> https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a

> https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2

> https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2

>

> I'm still trying to understand what is going on with my syslog servers but

> I was wondering... is this a known/documented issue?

If it is it would be known/documented by the syslog community right?

>

> Luckily this was a test cluster but I'm worried I could hit this on a

> production cluster any time soon, and I'm wondering how I could detect it

> before my support engineers loose their minds.

This does not appear to be a ceph-specific issue and would likely affect any

daemon that logs to syslog right?

One thing you could try is running strace against the MON to see what system

calls are taking a long time and extrapolate from there. The procedure would

be the same if things were being held up by a slow disk (for whatever reason)

or filesystem, etc. This is just a standard performance problem and not a

ceph-specific issue.

>

> Thanks,

>

> Sergio

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Cheers,

Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com