Re: [PATCH v2] xfs: make fatal assert failures conditional in debug mode

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 9 May 2017 13:00:50 -0400

On Tue, May 09, 2017 at 08:37:04AM -0700, Darrick J. Wong wrote:
> On Tue, May 09, 2017 at 09:11:31AM -0400, Brian Foster wrote:
> > On Tue, May 09, 2017 at 09:14:48AM +1000, Dave Chinner wrote:
> > > On Mon, May 08, 2017 at 08:55:32AM -0400, Brian Foster wrote:
> > > > On Sat, May 06, 2017 at 09:09:43AM +1000, Dave Chinner wrote:
> > > > > On Fri, May 05, 2017 at 09:31:26AM -0400, Brian Foster wrote:
> > > > > > XFS currently supports two debug modes: XFS_WARN enables assert
> > > > > > failure warnings and XFS_DEBUG converts assert failures to fatal
> > > > > > errors (via BUG()) and enables additional runtime debug code.
> > > > > > 
> > > > > > While the behavior to BUG the kernel on assert failure is useful in
> > > > > > certain test scenarios, it is also useful for development/debug to
> > > > > > enable debug mode code without having to crash the kernel on an
> > > > > > assert failure.
> > > > > > 
> > > > > > To provide this additional flexibility, update XFS debug mode to not
> > > > > > BUG() the kernel by default and create a new XFS kernel
> > > > > > configuration option to enable fatal assert failures when debug mode
> > > > > > is enabled. To provide backwards compatibility with current
> > > > > > behavior, enable the fatal asserts option by default when debug mode
> > > > > > is enabled.
> > > > > > 
> > > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > > > 
> > > > > Just a suggestion, but why make this a compile time option? Why not
> > > > > a sysfs variable under /sys/fs/xfs/debug? That would be far more
> > > > > useful to me - a single kernel that can be configure to just warn or
> > > > > bug() dynamically. That will save us from having to rebuild a kernel
> > > > > just to enable this functionality, then rebuild again to turn it
> > > > > off..
> > > > > 
> > > > 
> > > > I hadn't really considered that approach. The obvious drawback for me is
> > > > that whatever option is not default has to be reset on every boot or
> > > > module reload,
> > > 
> > > That's what sysctl is for.
> > > 
> > 
> > I suppose that helps, but only partially. It's still an extra step on a
> > new system (re: think disposable test system environments). I also don't
> > think it helps across module reloads (?), which means in certain cases
> > I'm most likely just back to holding a local patch to comment out the
> > BUG() (which is fine too).
> 
> How about a runtime config variable (sysctl, module param,
> {sys,config,debug}fs knob...) and a Kconfig option to set the default
> value of the knob?  That enables dangerous group xfstests to switch off
> the crash-on-assert behavior and allows developers to preconfigure
> whatever's most convenient for them.
> 

Yeah, that's what I was alluding to wrt the sidetrack discussion below.
I'm still curious what the use case is for runtime switching of assert
behavior alone (I don't think the xfstests case would be helped by
it..?), but I don't object to including both if a knob is useful for
somebody. My purpose/motivation for this was more based on the idea that
I rarely, if ever, have a need for assert failures to bug the kernel.
I'd like to be able to just turn it off in my default configs (without
disrupting the cases where it is useful).

> > > > the latter of which tends to be a common action in my use
> > > > cases (e.g., adding debug code to do specialized debugging or for some
> > > > new development work and reloading xfs as a kernel module). It's kind of
> > > > the opposite problem for general regression testing if we were to change
> > > > the default from BUG() to warn, for example. The tester would have to
> > > > remember to (or know) to twiddle the knob if one is expecting assert
> > > > failures to generate a BUG() and crash report.
> > > 
> > > Default would be the same as today - BUG on assert - so this
> > > wouldn't matter.
> > > 
> > 
> > Yes, the above is an analogy.
> > 
> > > > So for me, the ability to live switch between BUG() or warn in debug
> > > > mode doesn't add value. In fact, it is less ideal than just being able
> > > > to (re)compile a kernel module and load it with expected behavior. That
> > > 
> > > Who uses kernel modules for testing? I just use monolithic kernels
> > > because I can boot a new kernel in less time than it takes copy in
> > > and reload a new module to the test machine.
> 
> I do, because usually my code isn't so bad that it trashes the kernel
> state to the point of needing a reboot.  Lucky me. :P
> 

Such arrogance! :D

> (Now granted the test machine boots so slowly I go for a walk and the
> VMs boot so quickly I barely lose any time, so I don't really care one
> way or another... :))
> 
> > I use kernel modules quite a bit. For one, I tend to use test machines
> > that can take a minute or two to boot. Even when using a local vm,
> > copying a kernel module and reloading works quite well for me for
> > development, because I may have various contexts (i.e., screen/tmux
> > sessions) that I prefer not to lose and I tend to work iteratively.
> > 
> > But this is anecdotal... I don't think it matters much what workflow is
> > the one true best for testing or development. Rather, the question is
> > whether a workflow is common and useful to enough people to justify a
> > configuration option.
> > 
> > > > said, that's just my admittedly selfish use case. The ability to switch
> > > > off BUG() at all is still an improvement over the current situation, so
> > > > I'm open to a runtime knob if that is the more broadly useful solution.
> > > > Care to elaborate on how that is more useful to you?
> > > 
> > > e.g. think of the dangerous tests in xfstests that don't get run
> > > because they fire an assert and kill the test machine. have the test
> > > harness set "warn only" for the dangerous tests and now those tests
> > > are no longer dangerous and can be run as part of the auto group...
> > > 
> > 
> > It looks like the majority of the dangerous tests are tagged as such
> > because they historically (and legitimately) panic, crash or hang a
> > system. I don't think they are any more or less dangerous based on
> > assert behavior. Indeed, a quick check with the latest kernel doesn't
> > result in any dangerous test failures in either XFS_DEBUG or XFS_WARN
> > mode (though one or two were skipped).
> > 
> > > > A bit of a sidetrack...
> > > > 
> > > > To me, runtime live switching seems a bit more appropriate for something
> > > > at a higher level of enabling/disabling debug mode entirely as opposed
> > > > to solely assert behavior (for which it seems like overkill). A couple
> > > > problems with that are bloating the kernel and efficiency associated
> > > > with losing the ability to compile out asserts, both of which may make
> > > > something like that not realistic.
> > > > 
> > > > I do wonder, however, whether we could condense the current kernel
> > > > configuration into effectively two logical modes: production and debug.
> > > > The latter debug mode simply compiles in all of the debug code, but
> > > > supports various sub-modes: disabled, warn, debug (as today)[1]. The
> > > 
> > > We've talked about this in the past and enabling debug like
> > > having the allocator run random selection paths rather than optimal
> > > paths can lead to premature freespace fragmetnation and aging of the
> > > filesystem - exactly what you don't want for a production system
> > > you're trying to diagnose a problem on.
> > > 
> > 
> > Good point. I agree that we wouldn't want to enable things like the
> > random allocation algorithms, randomly doing sparse inode allocations
> > and such things that generally affect the structure on disk.
> > 
> > As mentioned previously, we can be more granular than the current binary
> > toggle for debug mode. E.g., we could separate diagnostic mechanisms
> > from test coverage mechanisms and enable the latter at a higher debug
> > level or with a separate option entirely, if desired. IOW, I don't think
> > that's a difficult problem to solve.
> 
> Agreed.
> 
> > > Also, there's debug code that doesn't scale well (such as extent
> > > list checking) and that sort of thing can badly affect performance -
> > > these are gotchas in debug builds that production diagnositic
> > > kernels should not trip over....
> > > 
> > 
> > Yeah, but I think there's a host of other things one can already enable
> > on a debug kernel package (I'm thinking about all of the memory related
> > bits, etc.) that can kill performance just as much, if not worse. It's
> > still a separate debug package after all. However, that is a good
> > argument for debug modes to be disabled by default (or effectively set
> > to XFS_WARN as it is defined today) in such configs, because there is no
> > guarantee the debug kernel is installed to debug an XFS problem.
> > 
> > > > default debug sub-mode can be selected at kernel compile time and
> > > > toggled at runtime. So effectively, a debug enabled kernel has the
> > > > ability to support arbitrary modes and a production kernel still has all
> > > > of that crap compiled out. This would allow, for example, a distro debug
> > > > kernel package to ship and enable actual debug code at runtime rather
> > > > than be limited to XFS_WARN, which is at least what we (rh) do today.
> > > > Thoughts? Useful, overkill?
> > > 
> > > XFS_WARN was the tradeoff for getting useful assert information out
> > > of production machines without impacting performance, allocation,
> > > etc by enabling the full debug code. I don't think anything has
> > > changed that alters that tradeoff since we added XFS_WARN...
> > > 
> > 
> > I'm not following the logic here... I don't think there's anything
> > inherently wrong with XFS_WARN as it is deployed today. I was just
> > suggesting that if we're going to think about providing dynamic debug
> > options, perhaps there's more benefit to step back and expose some of
> > the other diagnostic mechanisms that we have available rather than
> > simply to toggle assert behavior. IOWs, let the kernel config control
> > what is compiled in or not rather than behavior and let runtime
> > configuration control behavior.
> 
> Agreed here too, though I think it's acceptable to have a Kconfig option
> to set the default values of the knob(s).
> 

Yep. The more I thought about the above, I'd expect it would actually
require the configuration side of things to basically have two options.
First, a selector to compile debug capabilities into the kernel or not.
Second, and dependent on the first, a menu option or something that
selects the default mode on module load (e.g., Disabled, Warn only,
Diagnostic, XXXdangerXXXdontusethisunlessyouhackXFS), where the mode is
also selectable at runtime. We could effectively do something similar
just for assert behavior in debug mode.

Brian

> --D
> 
> > Brian
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@xxxxxxxxxxxxx
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html