Re: The costs of logging and not logging

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 21 Nov 2011 10:54:11 -0800 (PST)

On Mon, 21 Nov 2011, Mark Kampe wrote:
> The bugs we most dread are situations that only happen rarely,
> and are only detected long after the damage has been done.
> Given the business we are in, we will face many of them.
> We apparently have such bugs open at this very moment.
> 
> In most cases, the primary debugging tools one has are
> audit and diagnostic logs ... which WE do not have because
> they are too expensive (because they are synchronously
> written with C++ streams) to leave enabled all the time.
> 
> I think it is a mistake to think of audit and diagnostic
> logs as a tool to be turned on when we have a problem to
> debug.  There should be a basic level of logging that is
> always enabled (so we will have data after the first
> instance of the bug) ... which can be cranked up from
> verbose to bombastic when we find a problem that won't
> yield to more moderate interrogation:
> 
>  (a) after the problem happens is too late to
>      start collecting data.
> 
>  (b) these logs are gold mines of information for
>      a myriad of purposes we cannot yet even imagine.
> 
> This can only be done if the logging mechanism is
> sufficiently inexpensive that we are not afraid to
> use it:
>     low execution artifact from the logging operations
>     reansonable memory costs for bufferring
>     small enough on disk that we can keep them for months
> 
> Not having such a mechanism is (if I correctly
> understand) already hurting us for internal debugging,
> and will quickly cripple us when we have customer
> (i.e. people who cannot diagnose problems for
> themselves) problems to debug.
> 
> There are many tricks to make logging cheap, and the
> sizes acceptable.  There are probably a dozen open-source
> implementations that already do what we need, and if they
> don't something basic can be built in a two-digit number
> of hours.  The real cost is not in the mechanism but in
> adapting existing code to use it.  This cost can be
> mitigated by making the changes opportunistically ...
> one component at a time, as dictated by need/fear.
> 
> But we cannot make that change-over until we have a
> mechanism.  Because the greatest cost is not the
> mechanism, but the change-over, we should give more
> than passing thought to what mechanism to choose ...
> so that the decision we make remains a good one for
> the next few years.
> 
> This may be something that we need to do sooner,
> rather than later.

Yep.

I see two main issues with the slowness of the current logs:

 - all of the string rendering in the operator<<()'s is slow.  things like 
prefixing every line with a dump of the pg state is great for debugging, 
but makes the overhead very high.  we could scale all of that back, but 
it'd be a big project.
 - the logging always goes to a file, synchronously.  we could write to a 
ring buffer and either write it out only on crash, or (at the very least) 
write it async.

I wonder, though, if something different might work.  gcc lets you 
arbitrarily instrument function calls with -finstrument-functions.  
Something that logs function calls and arguments to an in-memory ring 
buffer and dumps that out on crash could potentially have a low overhead 
(if we restrict it to certain code) and would give us lots of insight into 
what happend leading up to the crash.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html