On 08/10/2017 08:25 AM, Jan Kara wrote: > On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote: >> On 08/09/2017 09:17 PM, Jens Axboe wrote: >>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>>>>> I shall check again. Where do you see this happening? >>>>>>>>> >>>>>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>>>>> Please see blk_queue_split. >>>>>>>>> >>>>>>>> >>>>>>>> In that case, the bio end_io function is chained and the bio of >>>>>>>> the split will replicate the error to the parent (if not already >>>>>>>> set). >>>>>>> >>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>>>>> of the bio probably already dispatched to disk (if the bio is >>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>>>>> block and dispatch to disk), what will application be going to do? >>>>>>> I think this is different to other IO errors. FOr other IO errors, >>>>>>> application will handle the error, while we ask app to retry the >>>>>>> whole bio here and app doesn't know part of bio is already written >>>>>>> to disk. >>>>>> >>>>>> It is the same as for other I/O errors as well, such as EIO. You do >>>>>> not know which bio of all submitted bio's returned the error EIO. >>>>>> The application would and should consider the whole I/O as failed. >>>>>> >>>>>> The user application does not know of bios, or how it is going to be >>>>>> split in the underlying layers. It knows at the system call level. >>>>>> In this case, the EAGAIN will be returned to the user for the whole >>>>>> I/O not as a part of the I/O. It is up to application to try the I/O >>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>>>>> out using dio->io_error. You can read about it at the patch header >>>>>> for the initial patchset at [1]. >>>>>> >>>>>> Use case: It is for applications having two threads, a compute >>>>>> thread and an I/O thread. It would try to push AIO as much as >>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>>>>> would pass it on to I/O thread which would perform without >>>>>> RWF_NOWAIT. End result if done right is you save on context switches >>>>>> and all the synchronization/messaging machinery to perform I/O. >>>>>> >>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >>>>> >>>>> Yes, I knew the concept, but I didn't see previous patches mentioned >>>>> the -EAGAIN actually should be taken as a real IO error. This means a >>>>> lot to applications and make the API hard to use. I'm wondering if we >>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN >>>>> only mean 'try again'. >>>> >>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say >>>> the API is hard to use? Do you have a case to back it up? >>> >>> Because it is hard to use, and potentially suboptimal. Let's say you're >>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a >>> short write, or do we return EWOULDBLOCK? If the latter, then that >>> really sucks from an API point of view. >>> >>>> No, not splitting the bio does not make sense here. I do not see any >>>> advantage in it, unless you can present a case otherwise. >>> >>> It ties back into the "hard to use" that I do agree with IFF we don't >>> return the short write. It's hard for an application to use that >>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the >>> full 1MB from a different context. >>> >> >> It returns the error code only and not short reads/writes. But isn't >> that true for all system calls in case of error? >> >> For aio, there are two result fields in io_event out of which one could >> be used for error while the other be used for amount of writes/reads >> performed. However, only one is used. This will not work with >> pread()/pwrite() calls though because of the limitation of return values. >> >> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say >> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are >> successful. What short return value should the system call return? > > This is indeed tricky. If an application submits 1MB write, I don't think > we can afford to just write arbitrary subset of it. That just IMHO too much > violates how writes traditionally behaved. Even short writes trigger bugs > in various applications but I'm willing to require that applications using > NOWAIT IO can handle these. However writing arbitrary subset looks like a > nasty catch. IMHO we should not submit further bios until we are sure > current one does not return EWOULDBLOCK when splitting a larger one... Exactly, that's the point that both Shaohua and I was getting at. Short writes should be fine, especially if NOWAIT is set. Discontig writes should also be OK, but it's horrible and inefficient. If we do that, then using this feature is a net-loss, not a win by any stretch. -- Jens Axboe