RE: Best practices for handling drive failures during a run?

Vincent Fu <vincent.fu@xxxxxxxxxxx> · Thu, 8 Sep 2022 14:59:51 +0000

> -----Original Message-----
> From: Nick Neumann [mailto:nick@xxxxxxxxxxxxxxxx]
> Sent: Wednesday, September 7, 2022 11:58 AM
> To: fio@xxxxxxxxxxxxxxx
> Subject: Best practices for handling drive failures during a run?
> 
> I was wondering if there were any recommendations/suggestions on
> handling drive failures during a fio run. I hit one yesterday with a
> 60 second mixed use test on an SSD. 51 seconds in, the drive basically
> stopped responding. (A separate program that periodically calls
> smartctl to get drive state also showed something was up, as data like
> temperature was missing.)
> 
> At 107 seconds, a read completed, and fio exited.
> 
> It made me wonder what would have happened if the test was not time
> limited - e.g., a full drive write. Would it have just hung, waiting
> forever? Or would the OS eventually get back to fio and tell it the
> submitted operations have failed and fio would exit?
> 
> Any ideas on ways to test the behavior, or areas of the code to look at?
> 

The null_blk device supports error injection via the badblocks configfs
variable. So you could use it for testing. There is a help guide for setting up
null_blk devices via configfs at https://zonedstorage.io/docs/getting-started/nullblk

Vincent