Application stops due to ext4 filesytsem IO error

Sumit Saxena <sumit.saxena@xxxxxxxxxxxx> · Mon, 5 Jun 2017 12:58:51 +0530

Jens,

We am observing  application stops while running ext4 filesystem IOs along
with target reset in parallel.
Our suspect is this behavior can be attributed to linux block layer. See
below for details-

Problem statement - " Application stops due to IO error from file system
buffered IO. (Note - It is always a FS meta data read failure)"
Issue is reproducible - "Yes. It is consistently reproducible."
Brief about setup -
Latest 4.11 kernel. Issue hits irrespective of whether SCSI MQ is enabled
or disabled. use_blk_mq=Y and use_blk_mq=N has similar issue.
Direct attached 4 SAS/SATA drives connected to MegaRAID Invader
controller.

Reproduction steps -
-Create ext4 FS on 4 JBODs(non RAID volumes) behind MegaRAID SAS
controller.
-Start Data integrity test on all four ext4 mounted partition. (Tool
should be configured to send Buffered FS IO).
-Send Target Reset  (have some delay between next reset to allow some IO
on device) on each JBOD to simulate error condition. (sg_reset -d
/dev/sdX).

End result -
Combination of target resets and FS IOs in parallel causes application
halt with ext4 Filesystem IO error.
We are able to restart  application without cleaning and unmounting
filesystem.
Below are the error logs at the time of application stop-

--------------------------
sd 0:0:53:0: target reset called for
scmd(ffff88003cf25148)
sd 0:0:53:0: attempting target reset!
scmd(ffff88003cf25148) tm_dev_handle 0xb
sd 0:0:53:0: [sde] tag#519 BRCM Debug: request->cmd_flags: 0x80700
bio->bi_flags: 0x2          bio->bi_opf: 0x3000 rq_flags 0x20e3
..
sd 0:0:53:0: [sde] tag#519 CDB: Read(10) 28 00 15 00 11 10 00 00 f8 00
EXT4-fs error (device sde): __ext4_get_inode_loc:4465: inode #11018287:
block 44040738: comm chaos: unable to read itable block
-----------------------

We debug further to understand what is happening above LLD. See below-

During target reset,  there may be IO coming from target with CHECK
CONDITION with below sense information-.
Sense Key : Aborted Command [current]
Add. Sense: No additional sense information

Such Aborted command should be retried by SML/Block layer. This happens
from SML expect for FS Meta data read.
>From driver level debug, we found IOs with REQ_FAILFAST_DEV bit set in
scmd->request->cmd_flags are not retried by SML and that is also as
expected.

Below is the code in scsi_error.c(function- scsi_noretry_cmd) which causes
IOs with REQ_FAILFAST_DEV enabled not getting retried bit completed back
to upper layer-
--------
/*
     * assume caller has checked sense and determined
     * the check condition was retryable.
     */
    if (scmd->request->cmd_flags & REQ_FAILFAST_DEV ||
        scmd->request->cmd_type == REQ_TYPE_BLOCK_PC)
        return 1;
    else
        return 0;
--------

IO which causes application to stop has REQ_FAILFAST_DEV enabled inside
"scmd->request->cmd_flags". We noticed that this bit will be set for
filesystem Read ahead meta data IOs. In order to confirm the same, we
mounted with option inode_readahead_blks=0 to disable ext4's inode table
readahead algorithm and did not observe the issue. Issue does not hit with
DIRECT IOs but only with cached/buffered IOs.

2. From driver level debug prints, we also noticed - There are many IO
failures with REQ_FAILFAST_DEV handled gracefully by filesystem.
Application level failure happens only If IO has RQF_MIXED_MERGE set.
If IO merging is disabled through sysfs parameter for SCSI device in
question- nomerges set to 2, we are not seeing the issue.

3. We added few prints in driver to dump "scmd->request->cmd_flags" and
"scmd->request->rq_flags" for IOs completed with CHECK CONDITION and
culprit IOs has all these bits- REQ_FAILFAST_DEV and REQ_RAHEAD bit set in
"scmd->request->cmd_flags" and RQF_MIXED_MERGE bit set in
"scmd->request->rq_flags". Also it's not necessarily true that all IOs
with these three bits set will cause issue but whenever issue hits, these
three bits are set for IO causing failure.

In summary,
FS mechanism of using READ AHEAD for meta data works fine (in case of IO
failure) if there is no mix/merge at block layer.
FS mechanism of using READ AHEAD for meta data has some corner case which
is not handled properly (in case of IO failure) if  there was  mix/merge
at block layer.
megaraid_sas driver's behavior seems correct here. Aborted IO goes to SML
with CHECK CONDITION settings and SML decided to fail fast IO as it was
requested.

Query -  Is this block layer (page cache) issue?  What should be the ideal
fix ?

Thanks,
Sumit