Re: [PATCH 2/2] moderr: add module error injection tool

Lucas De Marchi <lucas.demarchi@xxxxxxxxx> · Fri, 28 Feb 2025 12:48:38 -0600

On Fri, Feb 28, 2025 at 10:27:17AM +0100, Daniel Gomez wrote:
On Mon, Feb 24, 2025 at 08:43:45AM +0100, Lucas De Marchi wrote:
On Sat, Feb 22, 2025 at 10:35:07PM +0100, Daniel Gomez wrote:
> On Fri, Feb 21, 2025 at 12:15:40PM +0100, Luis Chamberlain wrote:
> > On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
> > > On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
> > > > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
> > > > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@xxxxxxxxxxx> wrote:
> > > > > >
> > > > > > Add support for a module error injection tool. The tool
> > > > > > can inject errors in the annotated module kernel functions
> > > > > > such as complete_formation(), do_init_module() and
> > > > > > module_enable_rodata_after_init(). Module name and module function are
> > > > > > required parameters to have control over the error injection.
> > > > > >
> > > > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > > > > > brd module:
> > > > > >
> > > > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > > > > > --error=-22 --trace
> > > > > > Monitoring module error injection... Hit Ctrl-C to end.
> > > > > > MODULE     ERROR FUNCTION
> > > > > > brd        -22   module_enable_rodata_after_init()
> > > > > >
> > > > > > Kernel messages:
> > > > > > [   89.463690] brd: module loaded
> > > > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > > > > > ro_after_init data might still be writable
> > > > > >
> > > > > > Signed-off-by: Daniel Gomez <da.gomez@xxxxxxxxxxx>
> > > > > > ---
> > > > > >  tools/bpf/Makefile            |  13 ++-
> > > > > >  tools/bpf/moderr/.gitignore   |   2 +
> > > > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> > > > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> > > > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> > > > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
> > > > > >  6 files changed, 510 insertions(+), 3 deletions(-)
> > > > >
> > > > > The tool looks useful, but we don't add tools to the kernel repo.
> > > > > It has to stay out of tree.
> > > >
> > > > For selftests we do add random tools.
> > > >
> > > > > The value of error injection is not clear to me.
> > > >
> > > > It is of great value, since it deals with corner cases which are
> > > > otherwise hard to reproduce in places which a real error can be
> > > > catostrophic.
> > > >
> > > > > Other places in the kernel use it to test paths in the kernel
> > > > > that are difficult to do otherwise.
> > > >
> > > > Right.
> > > >
> > > > > These 3 functions don't seem to be in this category.
> > > >
> > > > That's the key here we should focus on. The problem is when a maintainer
> > > > *does* agree that adding an error injection entry is useful for testing,
> > > > and we have a developer willing to do the work to help test / validate
> > > > it. In this case, this error case is rare but we do want to strive to
> > > > test this as we ramp up and extend our modules selftests.
> > > >
> > > > Then there is the aspect of how to mitigate how instrusive code changes
> > > > to allow error injection are. In 2021 we evaluated the prospect of error
> > > > injection in-kernel long ago for other areas like the block layer for
> > > > add_disk() failures [0] but the minimal interface to enable this from
> > > > userspace with debugfs was considered just too intrusive.
> > > >
> > > > This effort tried to evaluate what this could look like with eBPF to
> > > > mitigate the required in-kernel code, and I believe the light weight
> > > > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
> > > > suffices to my taste.
> > > >
> > > > So, perhaps the tools aspect can just go in:
> > > >
> > > > tools/testing/selftests/module/
> > >
> > > but why would it be module-specific?
> >
> > Gotta start somewhere.
> >
> > > Based on its current implementation
> > > and discussion about inject.py it seems to be generic enough to be
> > > useful to test any function annotated with ALLOW_ERROR_INJECTION().
> > >
> > > As xe driver maintainer, it may be interesting to use such a tool:
> > >
> > > 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
> > >
> > > How does this approach compare to writing the function name on debugfs
> > > (the current approach in xe's testsuite)?
> > >
> > > 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
> > > 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
> > >
> > > If you decide to have the tool to live somewhere else, then kmod repo
> > > could be a candidate.
> >
> > Would we install this upon install target?
> >
> > Danny can decide on this :)
> >
> > > Although I think having it in kernel tree is
> > > simpler maintenance-wise.
> >
> > I think we have at least two users upstream who can make use of it. If
> > we end up going through tools/testing/selftests/module/ first, can't
> > you make use of it later?
>
> What are the features in debugfs required to be useful for xe that we can
> port to an eBPF version? I see from the link provided the use of probability,
> interval, times and space but these are configured to allways trigger the error.
> Is that right?

I don't think we use them... we just set them to "always trigger" and
then create the conditions for that to happen.  But my question still
remains:  what is the benefit of using the bpf approach over writing
these files?

The code to trigger the error injection still needs to exist with both
approaches. My understanding from debugfs and the comment from Luis earlier in
the thread is that with eBPF you can mitigate how intrusive in-kernel code
changes are to allow error injection. Luis added the example here [1] for
debugfs.

[1] https://lore.kernel.org/all/20210512064629.13899-9-mcgrof@xxxxxxxxxx/

To compare patch 8 in [1] with eBPF approach: the patch describes
all the necessary changes required to allow error injection on the
add_disk() path. With eBPF one would simply annotate the function(s) with
ALLOW_ERROR_INJECTION(), e.g. device_add() and replace the return value
in eBPF code with bpf_override_return() as implemented in moderr tool for
module_enable_rdata_after_init() for example.

but that is all that we need with the fail_function in debugfs too:

$ git grep ALLOW_ERROR_INJECTION -- drivers/gpu/drm/xe
drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(xe_device_create, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(wait_for_lmem_ready, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_exec_queue.c:ALLOW_ERROR_INJECTION(xe_exec_queue_create_bind, ERRNO);
drivers/gpu/drm/xe/xe_ggtt.c:ALLOW_ERROR_INJECTION(xe_ggtt_init_early, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_guc_ads.c:ALLOW_ERROR_INJECTION(xe_guc_ads_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_guc_ct.c:ALLOW_ERROR_INJECTION(xe_guc_ct_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_guc_log.c:ALLOW_ERROR_INJECTION(xe_guc_log_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_guc_relay.c:ALLOW_ERROR_INJECTION(xe_guc_relay_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() is used to conditionally skip function execution
drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() macro but this is acceptable because for those
drivers/gpu/drm/xe/xe_pm.c:ALLOW_ERROR_INJECTION(xe_pm_init_early, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_create, ERRNO);
drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_prepare, ERRNO);
drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_run, ERRNO);
drivers/gpu/drm/xe/xe_sriov.c:ALLOW_ERROR_INJECTION(xe_sriov_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_sync.c:ALLOW_ERROR_INJECTION(xe_sync_entry_parse, ERRNO);
drivers/gpu/drm/xe/xe_tile.c:ALLOW_ERROR_INJECTION(xe_tile_init_early, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_uc_fw.c:ALLOW_ERROR_INJECTION(xe_uc_fw_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vma_ops_alloc, ERRNO);
drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vm_create_scratch, ERRNO);
drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(vm_bind_ioctl_ops_create, ERRNO);
drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(vm_bind_ioctl_ops_execute, ERRNO);
drivers/gpu/drm/xe/xe_wa.c:ALLOW_ERROR_INJECTION(xe_wa_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_wopcm.c:ALLOW_ERROR_INJECTION(xe_wopcm_init, ERRNO); /* See xe_pci_probe() */

That is different from the patch you are pointing to because that patch
is trying to add arbitrary/named error injection points throughout the
code. However via debugfs it's still possible to add error injection to
the beginning of a function by annotating that function with
ALLOW_ERROR_INJECTION. If a function is annotated with that, then if you
do e.g.

	echo xe_device_create > /sys/kernel/debug/fail_function/inject

it will cause that function to fail. There are some additional files to
control _when_ that function should fail, but I'm failing to see a clear
benefit. See this example in the docs:

	Documentation/fault-injection/fault-injection.rst:- Inject open_ctree error while btrfs mount::

Lucas De Marchi

New error injection users such modules or block/disk would benefit of the eBPF
approach with having a more simpler in-kernel code for the same purpose. Current
users of debugfs/errorinj would have to remove all the upstream intrusive error
injection related code if they want to adopt the eBPF approach.

Daniel

Lucas De Marchi

>
> >
> >   Luis