Re: [PATCH v2] xfs: make fatal assert failures conditional in debug mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 09, 2017 at 01:47:51PM -0400, Brian Foster wrote:
> On Tue, May 09, 2017 at 10:20:11AM -0700, Darrick J. Wong wrote:
> > On Tue, May 09, 2017 at 01:00:50PM -0400, Brian Foster wrote:
> > > On Tue, May 09, 2017 at 08:37:04AM -0700, Darrick J. Wong wrote:
> > > > On Tue, May 09, 2017 at 09:11:31AM -0400, Brian Foster wrote:
> > > > > On Tue, May 09, 2017 at 09:14:48AM +1000, Dave Chinner wrote:
> > > > > > On Mon, May 08, 2017 at 08:55:32AM -0400, Brian Foster wrote:
> > > > > > > On Sat, May 06, 2017 at 09:09:43AM +1000, Dave Chinner wrote:
> > > > > > > > On Fri, May 05, 2017 at 09:31:26AM -0400, Brian Foster wrote:
> > > > > > > > > XFS currently supports two debug modes: XFS_WARN enables assert
> > > > > > > > > failure warnings and XFS_DEBUG converts assert failures to fatal
> > > > > > > > > errors (via BUG()) and enables additional runtime debug code.
> > > > > > > > > 
> > > > > > > > > While the behavior to BUG the kernel on assert failure is useful in
> > > > > > > > > certain test scenarios, it is also useful for development/debug to
> > > > > > > > > enable debug mode code without having to crash the kernel on an
> > > > > > > > > assert failure.
> > > > > > > > > 
> > > > > > > > > To provide this additional flexibility, update XFS debug mode to not
> > > > > > > > > BUG() the kernel by default and create a new XFS kernel
> > > > > > > > > configuration option to enable fatal assert failures when debug mode
> > > > > > > > > is enabled. To provide backwards compatibility with current
> > > > > > > > > behavior, enable the fatal asserts option by default when debug mode
> > > > > > > > > is enabled.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > > > > > > 
> > > > > > > > Just a suggestion, but why make this a compile time option? Why not
> > > > > > > > a sysfs variable under /sys/fs/xfs/debug? That would be far more
> > > > > > > > useful to me - a single kernel that can be configure to just warn or
> > > > > > > > bug() dynamically. That will save us from having to rebuild a kernel
> > > > > > > > just to enable this functionality, then rebuild again to turn it
> > > > > > > > off..
> > > > > > > > 
> > > > > > > 
> > > > > > > I hadn't really considered that approach. The obvious drawback for me is
> > > > > > > that whatever option is not default has to be reset on every boot or
> > > > > > > module reload,
> > > > > > 
> > > > > > That's what sysctl is for.
> > > > > > 
> > > > > 
> > > > > I suppose that helps, but only partially. It's still an extra step on a
> > > > > new system (re: think disposable test system environments). I also don't
> > > > > think it helps across module reloads (?), which means in certain cases
> > > > > I'm most likely just back to holding a local patch to comment out the
> > > > > BUG() (which is fine too).
> > > > 
> > > > How about a runtime config variable (sysctl, module param,
> > > > {sys,config,debug}fs knob...) and a Kconfig option to set the default
> > > > value of the knob?  That enables dangerous group xfstests to switch off
> > > > the crash-on-assert behavior and allows developers to preconfigure
> > > > whatever's most convenient for them.
> > > > 
> > > 
> > > Yeah, that's what I was alluding to wrt the sidetrack discussion below.
> > > I'm still curious what the use case is for runtime switching of assert
> > > behavior alone (I don't think the xfstests case would be helped by
> > > it..?), but I don't object to including both if a knob is useful for
> > 
> > Some XFS code does things like:
> > 
> > if (badness) {
> > 	ASSERT(0);
> > 	return -EFSCORRUPTED;
> > }
> > 
> > Other parts do:
> > 
> > ASSERT(!badness);
> > /* code that basically does nothing if badness */
> > return nothinghappened;
> > 
> > In both of these cases, we can return something to userspace to indicate
> > that the requested operation didn't happen or went bonkers somewhere...
> > however there's the ASSERT there to make the badness fail noisily for
> > debug kernels.
> > 
> 
> Sure.. where today "noisely" is defined by whether the kernel config
> enables XFS_WARN or XFS_DEBUG. Where does runtime configuration of
> whether assert failures call BUG() come into play here?
> 
> > Moreover, some of the fuzzer tests do something like this:
> > 
> > 1. mkfs + populate fs
> > 2. corrupt fs
> > 3. try to use fs to see if we hit an error code
> > 4. unmount + xfs_repair
> > 5. retry step 3
> > 
> > It would be useful for this particular case to be able to keep going
> > after an ASSERT, because (provided we don't blow up internal kernel
> > state) we're effectively testing that xfs_repair actually fixes things
> > that make the kernel unhappy.
> > 
> > The really nasty xfs_scrub/xfs_repair fuzz tests systematically corrupt
> > every field in a metadata object, so they do:
> > 
> > 1. mkfs + populate fs
> > 2. for every field in <someobject>:
> >    a. corrupt field
> >    b. try to use fs to see if we hit an error code
> >    c. unmount + xfs_repair (or xfs_scrub)
> >    d. retry step 3
> > 
> > Since we're doing all fields in a loop, it would be helpful to be able
> > to run temporarily with BUG*() disabled so that we can try to stumble on
> > to the next field and generate a more thorough report instead of
> > crashing at the first sign of badness.
> > 
> 
> Hmm, Ok. So if you have a test that intentionally corrupts the fs and
> proceeds to use it, then assert failures can be expected. The purpose of
> the test is not necessarily to run "clean," but more to validate that
> the kernel doesn't keel over and presumably that xfs_repair can recover.

Correct.

> That seems like a reasonable argument to me, but do these tests actually
> depend on DEBUG code? Or is it simply that it's arguably wrong to panic
> during a test that expects to cause assert failures, and thus we'd like
> to override that particular behavior from the test if DEBUG mode happens
> to be enabled?

They don't specifically go looking to trip up ASSERTs, but sometimes
they hit them anyway.  I'm slowly converting those to "complain, take
the fs offline, and return -EFSCORRUPTED" as I encounter them.

--D

> 
> Brian
> 
> > > somebody. My purpose/motivation for this was more based on the idea that
> > > I rarely, if ever, have a need for assert failures to bug the kernel.
> > > I'd like to be able to just turn it off in my default configs (without
> > > disrupting the cases where it is useful).
> > 
> > Like you, I also have patches that quiet down the various warnings.
> > Would be nice not to have to keep carrying those.
> > 
> > > > > > > the latter of which tends to be a common action in my use
> > > > > > > cases (e.g., adding debug code to do specialized debugging or for some
> > > > > > > new development work and reloading xfs as a kernel module). It's kind of
> > > > > > > the opposite problem for general regression testing if we were to change
> > > > > > > the default from BUG() to warn, for example. The tester would have to
> > > > > > > remember to (or know) to twiddle the knob if one is expecting assert
> > > > > > > failures to generate a BUG() and crash report.
> > > > > > 
> > > > > > Default would be the same as today - BUG on assert - so this
> > > > > > wouldn't matter.
> > > > > > 
> > > > > 
> > > > > Yes, the above is an analogy.
> > > > > 
> > > > > > > So for me, the ability to live switch between BUG() or warn in debug
> > > > > > > mode doesn't add value. In fact, it is less ideal than just being able
> > > > > > > to (re)compile a kernel module and load it with expected behavior. That
> > > > > > 
> > > > > > Who uses kernel modules for testing? I just use monolithic kernels
> > > > > > because I can boot a new kernel in less time than it takes copy in
> > > > > > and reload a new module to the test machine.
> > > > 
> > > > I do, because usually my code isn't so bad that it trashes the kernel
> > > > state to the point of needing a reboot.  Lucky me. :P
> > > > 
> > > 
> > > Such arrogance! :D
> > > 
> > > > (Now granted the test machine boots so slowly I go for a walk and the
> > > > VMs boot so quickly I barely lose any time, so I don't really care one
> > > > way or another... :))
> > > > 
> > > > > I use kernel modules quite a bit. For one, I tend to use test machines
> > > > > that can take a minute or two to boot. Even when using a local vm,
> > > > > copying a kernel module and reloading works quite well for me for
> > > > > development, because I may have various contexts (i.e., screen/tmux
> > > > > sessions) that I prefer not to lose and I tend to work iteratively.
> > > > > 
> > > > > But this is anecdotal... I don't think it matters much what workflow is
> > > > > the one true best for testing or development. Rather, the question is
> > > > > whether a workflow is common and useful to enough people to justify a
> > > > > configuration option.
> > > > > 
> > > > > > > said, that's just my admittedly selfish use case. The ability to switch
> > > > > > > off BUG() at all is still an improvement over the current situation, so
> > > > > > > I'm open to a runtime knob if that is the more broadly useful solution.
> > > > > > > Care to elaborate on how that is more useful to you?
> > > > > > 
> > > > > > e.g. think of the dangerous tests in xfstests that don't get run
> > > > > > because they fire an assert and kill the test machine. have the test
> > > > > > harness set "warn only" for the dangerous tests and now those tests
> > > > > > are no longer dangerous and can be run as part of the auto group...
> > > > > > 
> > > > > 
> > > > > It looks like the majority of the dangerous tests are tagged as such
> > > > > because they historically (and legitimately) panic, crash or hang a
> > > > > system. I don't think they are any more or less dangerous based on
> > > > > assert behavior. Indeed, a quick check with the latest kernel doesn't
> > > > > result in any dangerous test failures in either XFS_DEBUG or XFS_WARN
> > > > > mode (though one or two were skipped).
> > > > > 
> > > > > > > A bit of a sidetrack...
> > > > > > > 
> > > > > > > To me, runtime live switching seems a bit more appropriate for something
> > > > > > > at a higher level of enabling/disabling debug mode entirely as opposed
> > > > > > > to solely assert behavior (for which it seems like overkill). A couple
> > > > > > > problems with that are bloating the kernel and efficiency associated
> > > > > > > with losing the ability to compile out asserts, both of which may make
> > > > > > > something like that not realistic.
> > > > > > > 
> > > > > > > I do wonder, however, whether we could condense the current kernel
> > > > > > > configuration into effectively two logical modes: production and debug.
> > > > > > > The latter debug mode simply compiles in all of the debug code, but
> > > > > > > supports various sub-modes: disabled, warn, debug (as today)[1]. The
> > > > > > 
> > > > > > We've talked about this in the past and enabling debug like
> > > > > > having the allocator run random selection paths rather than optimal
> > > > > > paths can lead to premature freespace fragmetnation and aging of the
> > > > > > filesystem - exactly what you don't want for a production system
> > > > > > you're trying to diagnose a problem on.
> > > > > > 
> > > > > 
> > > > > Good point. I agree that we wouldn't want to enable things like the
> > > > > random allocation algorithms, randomly doing sparse inode allocations
> > > > > and such things that generally affect the structure on disk.
> > > > > 
> > > > > As mentioned previously, we can be more granular than the current binary
> > > > > toggle for debug mode. E.g., we could separate diagnostic mechanisms
> > > > > from test coverage mechanisms and enable the latter at a higher debug
> > > > > level or with a separate option entirely, if desired. IOW, I don't think
> > > > > that's a difficult problem to solve.
> > > > 
> > > > Agreed.
> > > > 
> > > > > > Also, there's debug code that doesn't scale well (such as extent
> > > > > > list checking) and that sort of thing can badly affect performance -
> > > > > > these are gotchas in debug builds that production diagnositic
> > > > > > kernels should not trip over....
> > > > > > 
> > > > > 
> > > > > Yeah, but I think there's a host of other things one can already enable
> > > > > on a debug kernel package (I'm thinking about all of the memory related
> > > > > bits, etc.) that can kill performance just as much, if not worse. It's
> > > > > still a separate debug package after all. However, that is a good
> > > > > argument for debug modes to be disabled by default (or effectively set
> > > > > to XFS_WARN as it is defined today) in such configs, because there is no
> > > > > guarantee the debug kernel is installed to debug an XFS problem.
> > > > > 
> > > > > > > default debug sub-mode can be selected at kernel compile time and
> > > > > > > toggled at runtime. So effectively, a debug enabled kernel has the
> > > > > > > ability to support arbitrary modes and a production kernel still has all
> > > > > > > of that crap compiled out. This would allow, for example, a distro debug
> > > > > > > kernel package to ship and enable actual debug code at runtime rather
> > > > > > > than be limited to XFS_WARN, which is at least what we (rh) do today.
> > > > > > > Thoughts? Useful, overkill?
> > > > > > 
> > > > > > XFS_WARN was the tradeoff for getting useful assert information out
> > > > > > of production machines without impacting performance, allocation,
> > > > > > etc by enabling the full debug code. I don't think anything has
> > > > > > changed that alters that tradeoff since we added XFS_WARN...
> > > > > > 
> > > > > 
> > > > > I'm not following the logic here... I don't think there's anything
> > > > > inherently wrong with XFS_WARN as it is deployed today. I was just
> > > > > suggesting that if we're going to think about providing dynamic debug
> > > > > options, perhaps there's more benefit to step back and expose some of
> > > > > the other diagnostic mechanisms that we have available rather than
> > > > > simply to toggle assert behavior. IOWs, let the kernel config control
> > > > > what is compiled in or not rather than behavior and let runtime
> > > > > configuration control behavior.
> > > > 
> > > > Agreed here too, though I think it's acceptable to have a Kconfig option
> > > > to set the default values of the knob(s).
> > > > 
> > > 
> > > Yep. The more I thought about the above, I'd expect it would actually
> > > require the configuration side of things to basically have two options.
> > > First, a selector to compile debug capabilities into the kernel or not.
> > > Second, and dependent on the first, a menu option or something that
> > > selects the default mode on module load (e.g., Disabled, Warn only,
> > > Diagnostic, XXXdangerXXXdontusethisunlessyouhackXFS), where the mode is
> > > also selectable at runtime. We could effectively do something similar
> > > just for assert behavior in debug mode.
> > 
> > Yes.  I'd start with assert behavior in debug mode since we're already
> > talking about it and carrying patches around.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Dave.
> > > > > > -- 
> > > > > > Dave Chinner
> > > > > > david@xxxxxxxxxxxxx
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux