On 08/10/2017 09:28 AM, Jens Axboe wrote: > On 08/10/2017 08:25 AM, Jan Kara wrote: >> On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote: >>> On 08/09/2017 09:17 PM, Jens Axboe wrote: >>>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote: >>>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I >>>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but >>>>>>>>>>> I shall check again. Where do you see this happening? >>>>>>>>>> >>>>>>>>>> No, this isn't multi-device specific, any driver can do it. >>>>>>>>>> Please see blk_queue_split. >>>>>>>>>> >>>>>>>>> >>>>>>>>> In that case, the bio end_io function is chained and the bio of >>>>>>>>> the split will replicate the error to the parent (if not already >>>>>>>>> set). >>>>>>>> >>>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part >>>>>>>> of the bio probably already dispatched to disk (if the bio is >>>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't >>>>>>>> block and dispatch to disk), what will application be going to do? >>>>>>>> I think this is different to other IO errors. FOr other IO errors, >>>>>>>> application will handle the error, while we ask app to retry the >>>>>>>> whole bio here and app doesn't know part of bio is already written >>>>>>>> to disk. >>>>>>> >>>>>>> It is the same as for other I/O errors as well, such as EIO. You do >>>>>>> not know which bio of all submitted bio's returned the error EIO. >>>>>>> The application would and should consider the whole I/O as failed. >>>>>>> >>>>>>> The user application does not know of bios, or how it is going to be >>>>>>> split in the underlying layers. It knows at the system call level. >>>>>>> In this case, the EAGAIN will be returned to the user for the whole >>>>>>> I/O not as a part of the I/O. It is up to application to try the I/O >>>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled >>>>>>> out using dio->io_error. You can read about it at the patch header >>>>>>> for the initial patchset at [1]. >>>>>>> >>>>>>> Use case: It is for applications having two threads, a compute >>>>>>> thread and an I/O thread. It would try to push AIO as much as >>>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails, >>>>>>> would pass it on to I/O thread which would perform without >>>>>>> RWF_NOWAIT. End result if done right is you save on context switches >>>>>>> and all the synchronization/messaging machinery to perform I/O. >>>>>>> >>>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2 >>>>>> >>>>>> Yes, I knew the concept, but I didn't see previous patches mentioned >>>>>> the -EAGAIN actually should be taken as a real IO error. This means a >>>>>> lot to applications and make the API hard to use. I'm wondering if we >>>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN >>>>>> only mean 'try again'. >>>>> >>>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say >>>>> the API is hard to use? Do you have a case to back it up? >>>> >>>> Because it is hard to use, and potentially suboptimal. Let's say you're >>>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a >>>> short write, or do we return EWOULDBLOCK? If the latter, then that >>>> really sucks from an API point of view. >>>> >>>>> No, not splitting the bio does not make sense here. I do not see any >>>>> advantage in it, unless you can present a case otherwise. >>>> >>>> It ties back into the "hard to use" that I do agree with IFF we don't >>>> return the short write. It's hard for an application to use that >>>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the >>>> full 1MB from a different context. >>>> >>> >>> It returns the error code only and not short reads/writes. But isn't >>> that true for all system calls in case of error? >>> >>> For aio, there are two result fields in io_event out of which one could >>> be used for error while the other be used for amount of writes/reads >>> performed. However, only one is used. This will not work with >>> pread()/pwrite() calls though because of the limitation of return values. >>> >>> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say >>> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are >>> successful. What short return value should the system call return? >> >> This is indeed tricky. If an application submits 1MB write, I don't think >> we can afford to just write arbitrary subset of it. That just IMHO too much >> violates how writes traditionally behaved. Even short writes trigger bugs >> in various applications but I'm willing to require that applications using >> NOWAIT IO can handle these. However writing arbitrary subset looks like a >> nasty catch. IMHO we should not submit further bios until we are sure >> current one does not return EWOULDBLOCK when splitting a larger one... > > Exactly, that's the point that both Shaohua and I was getting at. Short > writes should be fine, especially if NOWAIT is set. Discontig writes > should also be OK, but it's horrible and inefficient. If we do that, > then using this feature is a net-loss, not a win by any stretch. > To make sure I understand this, we disable bio splits for NOWAIT bio so we return EWOULDBLOCK for the entire I/O. -- Goldwyn