On Fri, Dec 11, 2020 at 10:50:44AM +0800, JeffleXu wrote: > Excellent explanation! Thanks a lot. > > Still some questions below. > > On 12/10/20 1:18 PM, Dave Chinner wrote: > > On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote: > >> Sorry I'm still a little confused. > >> > >> > >> On 12/10/20 5:23 AM, Dave Chinner wrote: > >>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote: > >>>> > >>>> > >>>> On 12/7/20 10:21 AM, Dave Chinner wrote: > >>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: > >>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set > >>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when > >>>>>> IOCB_NOWAIT is set. > >>>>>> > >>>>>> Suggested-by: Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx> > >>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> > >>>>>> Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> > >>>>>> --- > >>>>>> > >>>>>> Hi all, > >>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a > >>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which > >>>>>> means REQ_NOWAIT is not set in bio->bi_opf. > >>>>> > >>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the > >>>>> filesystem behaviour, not the block device. > >>>>> > >>>>> REQ_NOWAIT can result in partial IO failures because the error is > >>>>> only reported to the iomap layer via IO completions. Hence we can > >>>>> split a DIO into multiple bios and have random bios in that IO fail > >>>>> with EAGAIN because REQ_NOWAIT is set. This error will > >>>>> get reported to the submitter via completion, and it will override > >>>>> any of the partial IOs that actually completed. > >>>>> > >>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug > >>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow > >>>>> NOWAIT DIO across extent boundaries") we'll get silent partial > >>>>> writes occurring because the second submitted bio in an IO can > >>>>> trigger EAGAIN errors with partial IO completion having already > >>>>> occurred. > >>>>> > >> > >>>>> Further, we don't allow partial IO completion for DIO on XFS at all. > >>>>> DIO must be completely submitted and completed or return an error > >>>>> without having issued any IO at all. Hence using REQ_NOWAIT for > >>>>> DIO bios is incorrect and not desirable. > >> > >> > >> The current block layer implementation causes that, as long as one split > >> bio fails, then the whole DIO fails, in which case several split bios > >> maybe have succeeded and the content has been written to the disk. This > >> is obviously what you called "partial IO completion". > >> > >> I'm just concerned on how do you achieve that "DIO must return an error > >> without having issued any IO at all". Do you have some method of > >> reverting the content has already been written into the disk when a > >> partial error happened? > > > > I think you've misunderstood what I was saying. I did not say > > "DIO must return an error without having issued any IO at all". > > There are two parts to my statement, and you just smashed part of > > the first statement into part of the second statement and came up > > something I didn't actually say. > > > > The first statement is: > > > > 1. "DIO must be fully submitted and completed ...." > > > > That is, if we need to break an IO up into multiple parts, the > > entire IO must be submitted and completed as a whole. We do not > > allow partial submission or completion of the IO at all because we > > cannot accurately report what parts of a multi-bio DIO that failed > > through the completion interface. IOWs, if any of the IOs after the > > first one fail submission, then we must complete all the IOs that > > have already been submitted before we can report the failure that > > occurred during the IO. > > > > 1. Actually I'm quite not clear on what you called "partial submission > or completion". Even when REQ_NOWAIT is set for all split bios of one > DIO, then all these split bios are **submitted** to block layer through > submit_bio(). Even when one split bio after the first one failed > **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns > BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will > still call submit_bio() for the remaining split bios. And then only when > all split bios complete, will the whole kiocb complete. Yes, so what you have there is complete submission and completion. i.e. what I described as #1. > So if you define "submission" as submitting to hardware disk (such as > nvme device driver), then it is indeed **partial submission** when > REQ_NOWAIT set. But if the definition of "submission" is actually > "submitting to block layer by calling submit_bio()", then all split bios > of one DIO are indeed submitted to block layer, even when one split bio > gets BLK_STS_AGAIN because of REQ_NOWIAT. The latter is the definition we using in iomap, because at this layer we are the ones submitting the bios and defining the bounds of the IO. What happens in the lower layers does not concern us here. Partial submission can occur at the iomap layer if IOCB_NOWAIT is set - that was what 883a790a8440 fixed. i.e. we do loop for all mapped extents { loop for mapped range { submit_bio(subrange) } } So if we have an extent range like so: start end +-----------+---------+------+-----------+ ext 1 ext 2 ext 3 ext 4 We actually have to loop above 4 times to map the 4 extents that span the user IO range. That means we need to submit at least 4 independent bios to run this IO. So if we submit all the bios in a first mapped range (ext 1), then go back to the filesystem to get the map for the next range to submit and it returns -EAGAIN, we break out of the outer loop without having completely submitted bios for the range spanning ext 2, 3 or 4. That's partial IO submission as the *iomap layer* defines it. And the "silent" part of the "partial write" it came from the fact this propagated -EAGAIN to the user despite the fact that some IO was actually successfully committed and completed. IOws, we may have corrupted data on disk by telling the user nothing was done with -EAGAIN rather than failing the partial IO with -EIO. This is the problem commit 883a790a8440 fixed. > 2. One DIO could be split into multiple bios in DIO layer. Similarly one > big bio could be split into multiple bios in block layer. In the > situation when all split bios have already been submitted to block > layer, since the IO completion interface could return only one error > code, the whole DIO could fail as long as one split bio fails, while > several other split bios could have already succeeded and the > corresponding disk sectors have already been overwritten. Yup. That can happen. That's a data corruption for which prevention is the responsibility of the layer doing the splitting and recombining. The iomap layer knows nothing of this - this situation needs to result in the entire bio that was split being marked with an EIO error because we do not know what parts of the bio made it to disk or not. IOWs, the result of the IO is undefined, and so we need to return EIO to the user. This "data is inconsistent until all sub-devices complete their IO successfully" situation is also known as the "RAID 5 write hole". Crash while these IOs are in flight, and some might hit the disk and some might not. When the system comes back up, you've got no idea which part of that IO was completed or not, so the data in the entire stripe is corrupt. > I'm not sure > if this is what you called "silent partial writes", and if this is the > core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across > extent boundaries") you mentioned in previous mail. No, it's not the same thing. This "split IO failed" should never result in a silent partial write - it should always result in EIO. > If this is the case, then it seems that the IO completion interface > should be blamed for this issue. Besides REQ_NOWIAT, there may be more > situations that will cause "silent partial writes". Yes, there are many potential causes of this behaviour. They are all considered to be a data corruption, and so in all cases they should be returning EIO as the completion status to the submitter of the bio. > As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP, > if possible), while other split bios may still succeeds, then the error > of one split bio will still override the completion status of the whole > DIO. If you get EOPNOTSUPP from on device in bio split across many devices, then you have a bug in the layer that is splitting the IO. All the sub-devices need to support the operation, or the top level device cannot safely support that operation. And if one device didn't do the opreation, then we have inconsistent data on disk now and so EIO is the result that should be returned to the user. > In this case "silent partial writes" is still possible? In my > understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer > only makes the problem more likely to happen, but it's not the only > source of this problem? Passing REQ_NOWAIT from the iomap dio layer means that we will see data corruptions in stable storage simply due to high IO load, rather than the much, much rarer situations of software bugs or actual hardware errors. IOWs, REQ_NOWAIT essentially guarantees data corruption in stable storage of a perfectly working storage stack will occur and there is nothing the application can do to prevent that occurring. The worst part of this is that REQ_NOWAIT also requires the application to detect and correct the corruption caused by the kernel using REQ_NOWAIT inappropriately. And if the application crashes before it can correct the data corruption, it's suddenly a permanent, uncorrectable data corruption. REQ_NOWAIT is dangerous and can easily cause harm to user data by aborting parts of submitted IOs. IOWs, it's more like a random IO error injection mechanism than a useful method of doing safe non-blocking IO. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx