Re: [PATCH v2 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 20, 2024 at 02:24:52PM +1030, Qu Wenruo wrote:
> Currently when we increase the device statistics, it would always lead
> to an error message in the kernel log.
> 
> However this output is mostly duplicated with the existing ones:
> 
> - For scrub operations
>   We always have the following messages:
>   * "fixed up error at logical %llu"
>   * "unable to fixup (regular) error at logical %llu"
> 
>   So no matter if the corruption is repaired or not, it scrub would
>   output an error message to indicate the problem.
> 
> - For non-scrub read operations
>   We also have the following messages:
>   * "csum failed root %lld inode %llu off %llu" for data csum mismatch
>   * "bad (tree block start|fsid|tree block level)" for metadata
>   * "read error corrected: ino %llu off %llu" for repaired data/metadata
> 
> So the error message from btrfs_dev_stat_inc_and_print() is duplicated.
> 
> The real usage for the btrfs device statistics is for some user space
> daemon to check if there is any new errors, acting like some checks on
> SMART, thus we don't really need/want those messages in dmesg.
> 
> This patch would reduce the log level to debug (disabled by default) for
> btrfs_dev_stat_inc_and_print().
> For users really want to utilize btrfs devices statistics, they should
> go check "btrfs device stats" periodically, and we should focus the
> kernel error messages to more important things.

I kind if disagree with each point.

The message is meant to be logged as it will happen in production and
outside of development, so the debug level does not make sense.

The stats message is not duplicated for the individual causes, it
additionally tracks the whole state.

Logging important messages to system log is a common thing and we do that
a lot, this makes debugging and anlyzing things easier. We can't
expect that there would always be a daemon collecting the stats, there's
not standardized or recommended tool for that. A quick look to dmesg can
show that something is wrong.

What we can do: reduce the number messages so the whole stats are
printed once per transaction if there is a change.

We can also tune which events also print the stats, for example flush
errors are more interesting than read/write, comparing the number of
events that can happen in a batch.




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux