Re: [RFC PATCH v2 0/4] make jbd2 debug switch per device

brookxu <brookxu.cn@xxxxxxxxx> · Thu, 28 Jan 2021 19:39:37 +0800

Theodore Ts'o wrote on 2021/1/28 0:21:
> On Tue, Jan 26, 2021 at 08:50:02AM +0800, brookxu wrote:
>>
>> trace point, eBPF and other hook technologies are better for production
>> environments. But for pure debugging work, adding hook points feels a bit
>> heavy. However, your suggestion is very valuable, thank you very much.
> 
> What feels heavy?  The act of adding a new jbd_debug() statement to
> the sources, versus adding a new tracepoint?  Or how to enable a set
> of tracepoints versus setting a jbd_debug level (either globally, or
> per mount point)?  Or something else?

Sorry, I didn't make it clear here. I mean the amount of code modification
and data analysis. Since we mainly do some process confirmation, if it is
to add trace points, the amount of code is relatively large, if it is to
add log, it is relatively simple. Secondly, the modification of the kernel
and analysis scripts is relatively simple.

> If it's the latter (which is what I think it is), how often are you
> needing to add a new jbd_debug() statement *and* needing to run in a
> test environment where you have multiple disks?  How often is it
> useful to have multiple disks when doing your debugging?

We don't use JBD2_DEBUG much in our work. In most cases, we tend to add
hook points and analyze data from hook points. But here because it is a
process confirmation, if the hook point method is adopted, there are more
hook points and the workload is relatively large. Secondly, these hook
points are not needed in the production environment, maybe it is a waste
of time.

> I'm trying to understand why this has been useful to you, since that
> generally doesn't match with my development, testing, or debugging
> experience.  In general I try to test with one file system at a time,
> since I'm trying to find something reproducible.  Do you have cases
> where you need multiple file systems in your test environment in order
> to do your testing?  Why is that?  Is it because you're trying to use
> your production server code as your test reproducers?  And if so, I
> would have thought adding the jbd_debug() statements and sending lots
> of console print messages would distort the timing enough to make it
> hard to reproduce a problem in found in your production environment.

In our mixed deployment production environment, we occasionally find that
containers will have priority inversion problems, that is, low-priority
containers will affect the Qos of high-priority containers. We try to do
something to make ext4 work better in the container scene. After a basic
test, we will use the business program to test, because the IO behavior
of the business program is relatively more complicated. It is worth noting
that here we are mainly concerned with the correctness of the process, not
particularly concerned with performance.

> It sounds like you have a very different set of test practices than
> what I'm used to, and I'm trying to understand it better.

:), Perhaps my verification method is not optimal, but I found that jbd2
has a similar framework, and tried to use it, and then found that some
things can be optimized.
> Cheers,
> 
> 						- Ted
>