Hi Ilya, On Thu, 2025-01-16 at 00:21 +0100, Ilya Dryomov wrote: > On Wed, Jan 15, 2025 at 9:53 PM Viacheslav Dubeyko > <Slava.Dubeyko@xxxxxxx> wrote: > > > > Hello, > > > > The libceph subsystem can generate enourmous amount of > > messages in the case of error. As a result, system log > > can be unreasonably big because of such messaging > > policy. This patch switches on ratelimited version of > > Hi Slava, > > Do you have an example (which is not caused by a programming error)? > Frankly speaking, there is no stable ground for definition what is the programming error. :) And if end-user can see some messages in the system log, then it's not always clear what is the reason of it (faulty hardware, wrong configuration, network issue, or programming error). Currently, I can see during running xfstests some sporadically triggered issues (and I am going to investigate this). For example, today I can reproduce it for generic/127 (but it passed successfully multiple times before). The output of this issue is the infinite sequence of messages in the system log: Jan 15 16:39:06 ceph-testing-0001 kernel: [ 4345.164299] libceph: mon2 (2)127.0.0.1:40902 socket error on write Jan 15 16:39:06 ceph-testing-0001 kernel: [ 4345.164321] libceph: mon1 (2)127.0.0.1:40900 socket error on write Jan 15 16:39:06 ceph-testing-0001 kernel: [ 4345.668314] libceph: mon1 (2)127.0.0.1:40900 socket error on write Jan 15 16:39:06 ceph-testing-0001 kernel: [ 4345.668337] libceph: mon2 (2)127.0.0.1:40902 socket error on write Jan 15 16:39:07 ceph-testing-0001 kernel: [ 4346.660371] libceph: mon2 (2)127.0.0.1:40902 socket error on write <skipped> Jan 15 17:16:30 ceph-testing-0001 kernel: [ 6589.691303] libceph: mon2 (2)127.0.0.1:40902 socket error on write Jan 15 17:16:31 ceph-testing-0001 kernel: [ 6590.907396] libceph: osd1 (2)127.0.0.1:6810 socket error on write Jan 15 17:16:34 ceph-testing-0001 kernel: [ 6593.659370] libceph: mon2 (2)127.0.0.1:40902 socket error on write Jan 15 17:16:37 ceph-testing-0001 kernel: [ 6597.051461] libceph: mon2 (2)127.0.0.1:40902 socket error on write <continue to spam system log until the system restart> > > pr_notice(), pr_info(), pr_warn(), and pr_err() > > methods by means of introducing libceph_notice(), > > libceph_info(), libceph_warn(), and libceph_err() > > methods. > > Some of libceph messages are already ratelimited and standard > pr_*_ratelimited macros are used for that. They are few apart, so > if there is a particular message that is too spammy, switching it to > a ratelimited version shouldn't be a problem, but we won't take > a blanket conversion like this. > Yes, I agree that even ratelimited version of messaging cannot solve the problem of spamming the system log by info, warning, or error messages. As far as I can see, we have infinite cycle in libceph core library that generates this never ending sequence of messages. I believe that it's not user-friendly behavior and we need to rework it somehow. I still don't quite follow why libceph core library's logic is trying to repeat the same action and reports the error if we already failed. Could we rework it somehow? I believe that we have some wrong logic in current implementation. I am not going to insist on this patch. But, for example, pr_err_ratelimited() is slightly long function name and libceph_err() can be shorter name. Also, to have libceph_<message> family of methods implemented in one place gives opportunity of easy modification methods in one place, for example, with the goal of adding more useful output. Finally, I believe (but I could be wrong) to have the ratelimited version of messages could slightly prevent from spamming the system log for cases that we cannot see right now, but some end-user can reproduce it in the production. Thanks, Slava.