On 08/10/2017 05:49 AM, Goldwyn Rodrigues wrote: > > > On 08/09/2017 09:17 PM, Jens Axboe wrote: >> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>>>> I shall check again. Where do you see this happening? >>>>>>>> >>>>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>>>> Please see blk_queue_split. >>>>>>>> >>>>>>> >>>>>>> In that case, the bio end_io function is chained and the bio of >>>>>>> the split will replicate the error to the parent (if not already >>>>>>> set). >>>>>> >>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>>>> of the bio probably already dispatched to disk (if the bio is >>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>>>> block and dispatch to disk), what will application be going to do? >>>>>> I think this is different to other IO errors. FOr other IO errors, >>>>>> application will handle the error, while we ask app to retry the >>>>>> whole bio here and app doesn't know part of bio is already written >>>>>> to disk. >>>>> >>>>> It is the same as for other I/O errors as well, such as EIO. You do >>>>> not know which bio of all submitted bio's returned the error EIO. >>>>> The application would and should consider the whole I/O as failed. >>>>> >>>>> The user application does not know of bios, or how it is going to be >>>>> split in the underlying layers. It knows at the system call level. >>>>> In this case, the EAGAIN will be returned to the user for the whole >>>>> I/O not as a part of the I/O. It is up to application to try the I/O >>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>>>> out using dio->io_error. You can read about it at the patch header >>>>> for the initial patchset at [1]. >>>>> >>>>> Use case: It is for applications having two threads, a compute >>>>> thread and an I/O thread. It would try to push AIO as much as >>>>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>>>> would pass it on to I/O thread which would perform without >>>>> RWF_NOWAIT. End result if done right is you save on context switches >>>>> and all the synchronization/messaging machinery to perform I/O. >>>>> >>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >>>> >>>> Yes, I knew the concept, but I didn't see previous patches mentioned >>>> the -EAGAIN actually should be taken as a real IO error. This means a >>>> lot to applications and make the API hard to use. I'm wondering if we >>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN >>>> only mean 'try again'. >>> >>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say >>> the API is hard to use? Do you have a case to back it up? >> >> Because it is hard to use, and potentially suboptimal. Let's say you're >> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a >> short write, or do we return EWOULDBLOCK? If the latter, then that >> really sucks from an API point of view. >> >>> No, not splitting the bio does not make sense here. I do not see any >>> advantage in it, unless you can present a case otherwise. >> >> It ties back into the "hard to use" that I do agree with IFF we don't >> return the short write. It's hard for an application to use that >> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the >> full 1MB from a different context. >> > > It returns the error code only and not short reads/writes. But isn't > that true for all system calls in case of error? It's not a hard error. If you wrote 896K in the example above, I'd really expect the return value to be 896*1024. The API is hard to use efficiently, if that's not the case. > For aio, there are two result fields in io_event out of which one could > be used for error while the other be used for amount of writes/reads > performed. However, only one is used. This will not work with > pread()/pwrite() calls though because of the limitation of return values. Don't invent something new for this, the mechanism already exists for returning a short read or write. That's how all of them have worked for decades. > Finally, what if the EWOULDBLOCK is returned for an earlier bio (say > offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are > successful. What short return value should the system call return? It should return 128*1024, since that's how much was successfully done from the start offset. But yes, this is exactly the point that I brought up, and why contesting Shaohua's suggestion to perhaps treat splits differently should not be discarded so quickly. -- Jens Axboe