Re: [PATCH v2] xfs: make fatal assert failures conditional in debug mode

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Tue, 9 May 2017 08:37:04 -0700

On Tue, May 09, 2017 at 09:11:31AM -0400, Brian Foster wrote:
> On Tue, May 09, 2017 at 09:14:48AM +1000, Dave Chinner wrote:
> > On Mon, May 08, 2017 at 08:55:32AM -0400, Brian Foster wrote:
> > > On Sat, May 06, 2017 at 09:09:43AM +1000, Dave Chinner wrote:
> > > > On Fri, May 05, 2017 at 09:31:26AM -0400, Brian Foster wrote:
> > > > > XFS currently supports two debug modes: XFS_WARN enables assert
> > > > > failure warnings and XFS_DEBUG converts assert failures to fatal
> > > > > errors (via BUG()) and enables additional runtime debug code.
> > > > > 
> > > > > While the behavior to BUG the kernel on assert failure is useful in
> > > > > certain test scenarios, it is also useful for development/debug to
> > > > > enable debug mode code without having to crash the kernel on an
> > > > > assert failure.
> > > > > 
> > > > > To provide this additional flexibility, update XFS debug mode to not
> > > > > BUG() the kernel by default and create a new XFS kernel
> > > > > configuration option to enable fatal assert failures when debug mode
> > > > > is enabled. To provide backwards compatibility with current
> > > > > behavior, enable the fatal asserts option by default when debug mode
> > > > > is enabled.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > > 
> > > > Just a suggestion, but why make this a compile time option? Why not
> > > > a sysfs variable under /sys/fs/xfs/debug? That would be far more
> > > > useful to me - a single kernel that can be configure to just warn or
> > > > bug() dynamically. That will save us from having to rebuild a kernel
> > > > just to enable this functionality, then rebuild again to turn it
> > > > off..
> > > > 
> > > 
> > > I hadn't really considered that approach. The obvious drawback for me is
> > > that whatever option is not default has to be reset on every boot or
> > > module reload,
> > 
> > That's what sysctl is for.
> > 
> 
> I suppose that helps, but only partially. It's still an extra step on a
> new system (re: think disposable test system environments). I also don't
> think it helps across module reloads (?), which means in certain cases
> I'm most likely just back to holding a local patch to comment out the
> BUG() (which is fine too).

How about a runtime config variable (sysctl, module param,
{sys,config,debug}fs knob...) and a Kconfig option to set the default
value of the knob?  That enables dangerous group xfstests to switch off
the crash-on-assert behavior and allows developers to preconfigure
whatever's most convenient for them.

> > > the latter of which tends to be a common action in my use
> > > cases (e.g., adding debug code to do specialized debugging or for some
> > > new development work and reloading xfs as a kernel module). It's kind of
> > > the opposite problem for general regression testing if we were to change
> > > the default from BUG() to warn, for example. The tester would have to
> > > remember to (or know) to twiddle the knob if one is expecting assert
> > > failures to generate a BUG() and crash report.
> > 
> > Default would be the same as today - BUG on assert - so this
> > wouldn't matter.
> > 
> 
> Yes, the above is an analogy.
> 
> > > So for me, the ability to live switch between BUG() or warn in debug
> > > mode doesn't add value. In fact, it is less ideal than just being able
> > > to (re)compile a kernel module and load it with expected behavior. That
> > 
> > Who uses kernel modules for testing? I just use monolithic kernels
> > because I can boot a new kernel in less time than it takes copy in
> > and reload a new module to the test machine.

I do, because usually my code isn't so bad that it trashes the kernel
state to the point of needing a reboot.  Lucky me. :P

(Now granted the test machine boots so slowly I go for a walk and the
VMs boot so quickly I barely lose any time, so I don't really care one
way or another... :))

> I use kernel modules quite a bit. For one, I tend to use test machines
> that can take a minute or two to boot. Even when using a local vm,
> copying a kernel module and reloading works quite well for me for
> development, because I may have various contexts (i.e., screen/tmux
> sessions) that I prefer not to lose and I tend to work iteratively.
> 
> But this is anecdotal... I don't think it matters much what workflow is
> the one true best for testing or development. Rather, the question is
> whether a workflow is common and useful to enough people to justify a
> configuration option.
> 
> > > said, that's just my admittedly selfish use case. The ability to switch
> > > off BUG() at all is still an improvement over the current situation, so
> > > I'm open to a runtime knob if that is the more broadly useful solution.
> > > Care to elaborate on how that is more useful to you?
> > 
> > e.g. think of the dangerous tests in xfstests that don't get run
> > because they fire an assert and kill the test machine. have the test
> > harness set "warn only" for the dangerous tests and now those tests
> > are no longer dangerous and can be run as part of the auto group...
> > 
> 
> It looks like the majority of the dangerous tests are tagged as such
> because they historically (and legitimately) panic, crash or hang a
> system. I don't think they are any more or less dangerous based on
> assert behavior. Indeed, a quick check with the latest kernel doesn't
> result in any dangerous test failures in either XFS_DEBUG or XFS_WARN
> mode (though one or two were skipped).
> 
> > > A bit of a sidetrack...
> > > 
> > > To me, runtime live switching seems a bit more appropriate for something
> > > at a higher level of enabling/disabling debug mode entirely as opposed
> > > to solely assert behavior (for which it seems like overkill). A couple
> > > problems with that are bloating the kernel and efficiency associated
> > > with losing the ability to compile out asserts, both of which may make
> > > something like that not realistic.
> > > 
> > > I do wonder, however, whether we could condense the current kernel
> > > configuration into effectively two logical modes: production and debug.
> > > The latter debug mode simply compiles in all of the debug code, but
> > > supports various sub-modes: disabled, warn, debug (as today)[1]. The
> > 
> > We've talked about this in the past and enabling debug like
> > having the allocator run random selection paths rather than optimal
> > paths can lead to premature freespace fragmetnation and aging of the
> > filesystem - exactly what you don't want for a production system
> > you're trying to diagnose a problem on.
> > 
> 
> Good point. I agree that we wouldn't want to enable things like the
> random allocation algorithms, randomly doing sparse inode allocations
> and such things that generally affect the structure on disk.
> 
> As mentioned previously, we can be more granular than the current binary
> toggle for debug mode. E.g., we could separate diagnostic mechanisms
> from test coverage mechanisms and enable the latter at a higher debug
> level or with a separate option entirely, if desired. IOW, I don't think
> that's a difficult problem to solve.

Agreed.

> > Also, there's debug code that doesn't scale well (such as extent
> > list checking) and that sort of thing can badly affect performance -
> > these are gotchas in debug builds that production diagnositic
> > kernels should not trip over....
> > 
> 
> Yeah, but I think there's a host of other things one can already enable
> on a debug kernel package (I'm thinking about all of the memory related
> bits, etc.) that can kill performance just as much, if not worse. It's
> still a separate debug package after all. However, that is a good
> argument for debug modes to be disabled by default (or effectively set
> to XFS_WARN as it is defined today) in such configs, because there is no
> guarantee the debug kernel is installed to debug an XFS problem.
> 
> > > default debug sub-mode can be selected at kernel compile time and
> > > toggled at runtime. So effectively, a debug enabled kernel has the
> > > ability to support arbitrary modes and a production kernel still has all
> > > of that crap compiled out. This would allow, for example, a distro debug
> > > kernel package to ship and enable actual debug code at runtime rather
> > > than be limited to XFS_WARN, which is at least what we (rh) do today.
> > > Thoughts? Useful, overkill?
> > 
> > XFS_WARN was the tradeoff for getting useful assert information out
> > of production machines without impacting performance, allocation,
> > etc by enabling the full debug code. I don't think anything has
> > changed that alters that tradeoff since we added XFS_WARN...
> > 
> 
> I'm not following the logic here... I don't think there's anything
> inherently wrong with XFS_WARN as it is deployed today. I was just
> suggesting that if we're going to think about providing dynamic debug
> options, perhaps there's more benefit to step back and expose some of
> the other diagnostic mechanisms that we have available rather than
> simply to toggle assert behavior. IOWs, let the kernel config control
> what is compiled in or not rather than behavior and let runtime
> configuration control behavior.

Agreed here too, though I think it's acceptable to have a Kconfig option
to set the default values of the knob(s).

--D

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html