RE: [RFC PATCH] ceph: switch libceph on ratelimited messages

Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> · Thu, 16 Jan 2025 01:38:40 +0000

Hi Ilya,

On Thu, 2025-01-16 at 00:21 +0100, Ilya Dryomov wrote:
> On Wed, Jan 15, 2025 at 9:53 PM Viacheslav Dubeyko
> <Slava.Dubeyko@xxxxxxx> wrote:
> > 
> > Hello,
> > 
> > The libceph subsystem can generate enourmous amount of
> > messages in the case of error. As a result, system log
> > can be unreasonably big because of such messaging
> > policy. This patch switches on ratelimited version of
> 
> Hi Slava,
> 
> Do you have an example (which is not caused by a programming error)?
> 

Frankly speaking, there is no stable ground for definition
what is the programming error. :) And if end-user can see
some messages in the system log, then it's not always clear
what is the reason of it (faulty hardware, wrong configuration,
network issue, or programming error).

Currently, I can see during running xfstests some sporadically
triggered issues (and I am going to investigate this). For example,
today I can reproduce it for generic/127 (but it passed successfully
multiple times before). The output of this issue is the infinite
sequence of messages in the system log:

Jan 15 16:39:06 ceph-testing-0001 kernel: [ 4345.164299] libceph: mon2
(2)127.0.0.1:40902 socket error on write
Jan 15 16:39:06 ceph-testing-0001 kernel: [ 4345.164321] libceph: mon1
(2)127.0.0.1:40900 socket error on write
Jan 15 16:39:06 ceph-testing-0001 kernel: [ 4345.668314] libceph: mon1
(2)127.0.0.1:40900 socket error on write
Jan 15 16:39:06 ceph-testing-0001 kernel: [ 4345.668337] libceph: mon2
(2)127.0.0.1:40902 socket error on write
Jan 15 16:39:07 ceph-testing-0001 kernel: [ 4346.660371] libceph: mon2
(2)127.0.0.1:40902 socket error on write

<skipped>

Jan 15 17:16:30 ceph-testing-0001 kernel: [ 6589.691303] libceph: mon2
(2)127.0.0.1:40902 socket error on write
Jan 15 17:16:31 ceph-testing-0001 kernel: [ 6590.907396] libceph: osd1
(2)127.0.0.1:6810 socket error on write
Jan 15 17:16:34 ceph-testing-0001 kernel: [ 6593.659370] libceph: mon2
(2)127.0.0.1:40902 socket error on write
Jan 15 17:16:37 ceph-testing-0001 kernel: [ 6597.051461] libceph: mon2
(2)127.0.0.1:40902 socket error on write

<continue to spam system log until the system restart>

> > pr_notice(), pr_info(), pr_warn(), and pr_err()
> > methods by means of introducing libceph_notice(),
> > libceph_info(), libceph_warn(), and libceph_err()
> > methods.
> 
> Some of libceph messages are already ratelimited and standard
> pr_*_ratelimited macros are used for that.  They are few apart, so
> if there is a particular message that is too spammy, switching it to
> a ratelimited version shouldn't be a problem, but we won't take
> a blanket conversion like this.
> 

Yes, I agree that even ratelimited version of messaging cannot
solve the problem of spamming the system log by info, warning, or
error messages. As far as I can see, we have infinite cycle in
libceph core library that generates this never ending sequence of
messages. I believe that it's not user-friendly behavior and
we need to rework it somehow. I still don't quite follow why libceph
core library's logic is trying to repeat the same action and
reports the error if we already failed. Could we rework it somehow?
I believe that we have some wrong logic in current implementation.

I am not going to insist on this patch. But, for example,
pr_err_ratelimited() is slightly long function name and
libceph_err() can be shorter name. Also, to have libceph_<message>
family of methods implemented in one place gives opportunity
of easy modification methods in one place, for example, with the goal
of adding more useful output. Finally, I believe (but I could be wrong)
to have the ratelimited version of messages could slightly prevent from
spamming the system log for cases that we cannot see right now, but
some end-user can reproduce it in the production.

Thanks,
Slava.