Re: [PATCH v1 8/8] block: add add_disk() failure injection support

Hannes Reinecke <hare@xxxxxxx> · Wed, 12 May 2021 17:22:48 +0200

On 5/12/21 8:46 AM, Luis Chamberlain wrote:
For a long time we have lived without any error handling
on the add_disk() error path. Now that we have some initial
error handling, add error injection support for its path so
that we can test it and ensure we don't regress this path
moving forward.

This only adds runtime code *iff* the new bool CONFIG_FAIL_ADD_DISK is
enabled in your kernel. If you don't have this enabled this provides
no new functional. When CONFIG_FAIL_ADD_DISK is disabled the new routine
blk_should_fail_add_disk() ends up being transformed to if (false), and
so the compiler should optimize these out as dead code producing no
new effective binary changes.

Failure injection lets us configure at boot how often we want a failure
to take place by specifying the interval, the probability, and when needed
a size constraint. We don't need to test for size constraints for
add_disk() and so ignore that part of error injection. Although testing
early boot failures with add_disk() failures might be useful we don't
to make add_disk() fail every time as otherwise we wouldn't be able to
boot. So enabling add_disk() error injection requires a second post
boot step where you specify where in the add_disk() code path you want
to enable failure injection for. This lets us verify correctness of
the different error handling parts of add_disk(), while also allowing
a respective blktests test to grow dynamically in case the add_disk()
paths grows.

We currently enable 11 code paths on add_disk() which can fail
and we can test for:

	# ls -1 /sys/kernel/debug/block/config_fail_add_disk/
	alloc_devt
	alloc_events
	bdi_register
	device_add
	disk_add_events
	get_queue
	integrity_add
	register_disk
	register_queue
	sysfs_bdi_link
	sysfs_depr_link

If you want to modify the configuration of fail_add_disk dynamically
at boot, you can enable CONFIG_FAULT_INJECTION_DEBUG_FS. If you've
enabled CONFIG_FAIL_ADD_DISK you will see these knobs:

	# ls -1 /sys/kernel/debug/block/fail_add_disk/
	interval
	probability
	space
	task-filter
	times
	verbose
	verbose_ratelimit_burst
	verbose_ratelimit_interval_ms

Suggested-by: Bart Van Assche <bvanassche@xxxxxxx>
Signed-off-by: Luis Chamberlain <mcgrof@xxxxxxxxxx>
---
  .../fault-injection/fault-injection.rst       | 23 ++++++++
  block/Makefile                                |  1 +
  block/blk-core.c                              |  1 +
  block/blk.h                                   | 55 ++++++++++++++++++
  block/failure-injection.c                     | 54 ++++++++++++++++++
  block/genhd.c                                 | 57 +++++++++++++++++++
  lib/Kconfig.debug                             | 13 +++++
  7 files changed, 204 insertions(+)
  create mode 100644 block/failure-injection.c

[ .. ]

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d1467658361f..4fccc0fad190 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1917,6 +1917,19 @@ config FAULT_INJECTION_USERCOPY
  	  Provides fault-injection capability to inject failures
  	  in usercopy functions (copy_from_user(), get_user(), ...).
  
+config FAIL_ADD_DISK
+	bool "Fault-injection capability for add_disk() callers"
+	depends on FAULT_INJECTION && BLOCK
+	help
+	  Provide fault-injection capability for the add_disk() block layer
+	  call path. This allows the kernel to provide error injection when
+	  the add_disk() call is made. You would use something like blktests
+	  test against this or just load the null_blk driver. This only
+	  enables the error injection functionality. To use it you must
+	  configure which path you want to trigger on error on using debugfs
+	  under /sys/kernel/debug/block/config_fail_add_disk/. By default
+	  all of these are disabled.
+
  config FAIL_MAKE_REQUEST
  	bool "Fault-injection capability for disk IO"
  	depends on FAULT_INJECTION && BLOCK


Hmm. Not a fan of this approach.

Having to have a separate piece of code just to test individual 
functions, _and_ having to place hooks in the code to _simulate_ a 
failure seems rather fragile to me.

I would have vastly preferred if we could to this via generic tools like 
ebpf or livepatching.
Also I'm worried that this approach doesn't really scale; taken to 
extremes we would have to add duplicate calls to each and every function 
for full error injection, essentially double the size of the code just 
on the off-chance that someone wants to do error injection.

So I'd rather delegate the topic of error injection to a more general 
discussion (LSF springs to mind ...), and then agree on a framework 
which is suitable for every function.

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@xxxxxxx                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer