On Jun 23, 2017, at 6:26 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: > > On Fri, Jun 23, 2017 at 12:33:02AM -0600, Andreas Dilger wrote: >> On Jun 22, 2017, at 22:43, Theodore Ts'o <tytso@xxxxxxx> wrote: >>> >>>> On Thu, Jun 22, 2017 at 04:23:07PM -0700, Khazhismel Kumykov wrote: >>>> Previously, a read error would be ignored and we would eventually return >>>> NULL from ext4_find_entry, which signals "no such file or directory". We >>>> should be returning EIO. >>>> >>>> Signed-off-by: Khazhismel Kumykov <khazhy@xxxxxxxxxx> >>> >>> Thanks, applied. >> >> I don't necessarily agree that this is an improvement. >> >> If the requested entry is not in the bad block, this will return an >> error even if the file name could be found in another block. It >> would be better to save the error until the end and only return -EIO >> if the entry cannot be found. > > The problem is that if we continue, successive reads may all take > seconds or minutes to fail, thus tieing up the process for a long > time. Sorry, I don't understand where the seconds or minutes of delay come from? Is that because of long SCSI retries in the block layer, or in the disk itself, or something caused specifically because of this code? This is in the non-htree lookup path, so it is already doing a linear directory scan to find the entry. If this is a non-htree directory because the directory is only a single block in size, then saving the EIO to return at "the end" is identical to returning it immediately. If it is using this codepath because it is a large non-htree directory, then there is a real chance that the entry can be found in another block, and this will cause the application to fail trying to find the file when it doesn't need to. Since this is after EXT4_ERROR_INODE() then the filesystem must be mounted with "errors=continue", so I'd think in that case the filesystem should try to continue in the face of errors. > If this process happens to be, say, the node's Kubernetes > management server it can take down the entire node (since if there is > a watchdog daemon which assumes that if the management server is down, > it's time to kill and reset the entire node), and the file system is, > say, a network file system error which should only kill the individual > job, and not the entire node, the results could be quite unfortunate. Sorry, I don't understand your example. It isn't clear where the error is coming from? Is this a media error or I don't understand how this could be a network filesystem error? To my reading, the patch removing the ability to skip a bad directory leaf block would _induce_ an error rather than avoid it. > By returning EIO right away, we can "fast fail". But it seems like you don't necessarily need to fail at all? Something like the following would return an error if the entry is not found, but still search the rest of the leaf blocks (if any) before giving up: diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index b81f7d4..c75b6dc 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1434,6 +1434,8 @@ static struct buffer_head * ext4_find_entry (struct inode *dir, EXT4_ERROR_INODE(dir, "reading directory lblock %lu", (unsigned long) block); brelse(bh); + /* if entry isn't found in a later block, return an error */ + ret = ERR_PTR(-EIO); goto next; } if (!buffer_verified(bh) && Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP