Re: [RFC] writev() semantics with invalid iovec in the middle

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



PAGE_SIZE isn't accurate on architectures which do multiple page
sizes, like 8k, 64k, 512k, 4M, 32M, 256M on SPARC64 and same on
PPC64/Power.

Ced

On 16 September 2016 at 00:29, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Thu, Sep 15, 2016 at 06:23:24AM -0400, Mike Marshall wrote:
>> If you squeeze out every byte won't you still have a short
>> write? And the written data wouldn't be cut at the bad
>> place, but it would have a weird hole or discontinuity there.
>
> ???
>
> What I mean is that if we have an invalid address in the middle of a buffer
> (unmapped, for example), we do not attempt to write every byte prior to that
> invalid address.  Of course what we write is going to be contiguous.
>
> Suppose we have a buffer spanning 10 pages (amd64, so these are 4K ones) -
> 7 valid, 3 invalid:
>         VVVVIIIVV
> and it starts 100 bytes into the first page.  And write goes into a regular
> file on e.g. tmpfs, starting at offset 31.  We _can't_ write more than
> 4*4096-100 bytes, no matter what.  It will be a short write.  As the matter
> of fact, it will be even shorter than that - it will be 3*4096-31 bytes,
> up to the last pagecache boundary we can cover completely.  That obviously
> depends upon the filesystem - not everything uses pagecache, for starters.
> However, the caller is *not* guaranteed that write() with an invalid page
> in the middle of a buffer would write everything up to the very beginning
> of the invalid page.  A short write will happen, but the amount written
> might be up to page size less than the actual length of valid part in the
> beginning of the buffer.
>
> Now, for writev() we could have invalid pages in any iovec; again, we
> obviously can't write anything past the first invalid page - we'll get
> either a short write or -EFAULT (if nothing got written).  That's fine;
> the question is what the caller can count upon wrt shortening.
>
> Again, we are *not* guaranteed writing up to exact boundary.  However, the
> current implementation will end up shortening no more than to the iovec
> boundary.  I.e. if the first iovec contains only valid pages and there's
> an invalid one in the second iovec, the current implementation will write
> at least everything in the first iovec.  That's _not_ promised by POSIX
> or our manpages; moreover, I'm not sure if it's even true for each filesystem.
> And keeping that property is actually inconvenient - if we could discard it,
> we could make partial-copy ->write_end() calls a lot more infrequent.
>
> Unfortunately, some of LTP writev tests end up checking that writev() does
> behave that way - they feed it a three-element iovec with shorter-than-page
> segments, the second of which is all invalid.  And they check that the
> entire first segment had been written.
>
> I would really like to drop that property, making it "if some addresses
> in the buffer(s) we are asked to write are invalid, the write will be
> shortened by up to a PAGE_SIZE from the first such invalid address", making
> writev() rules exactly the same as write() ones.  Does anybody have objections
> to it?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cedric Blancher <cedric.blancher@xxxxxxxxx>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux