On Thu, Aug 3, 2017 at 10:40 PM, Avi Kivity <avi@xxxxxxxxxxxx> wrote: > On 08/04/2017 01:09 AM, Dave Chinner wrote: >> >> On Thu, Aug 03, 2017 at 05:52:45PM +0300, Avi Kivity wrote: >>> >>> Hello, >>> >> Hi Avi, >> >>> I have an application that uses AIO+DIO to write data to a file on >>> XFS. The writes use 128k buffers. Very rarely, I see aligned 4k >>> blocks within the file that are zeroed. The blocks are not aligned >>> to 128k boundary, just 4k. The buffers are allocated in anonymous >>> memory, which is usually using transparent hugepages. The files are >>> fully allocated, not sparse (checked post-mortem). >> >> Did you check that the extents are written? i.e. there aren't >> sporadic 4k unwritten extents in the file? (xfs_bmap -vvp output) > > > Raphael did that, and the result was that the file was NOT sparse. > > btw, we also run with the extent size hint set to 32MB. > >> If you turn off transparent huge pages, does the problem go >> away? > > > We did not check yet. > >> What kernel version is this seen on? We've changed the XFS DIO >> IO path implementation substantially in recent times.... > > > CentOS 7.2's kernel. Glauber, do you now the precise version string? Yes I do, sir! 3.10.0-327.el7.x86_64 (Hey, Dave!) > >>> The writes are concurrent and adjacent. To avoid serialization, we >>> ftruncate() the file to a larger size, then ftruncate() it back when >>> we know its final size. >> >> So it's not extending the file on the writes, so it shouldn't be >> triggering EOF block zeroing. The only thing I can think of is >> either the data contains zeros or there's an occasional unwritten >> extent in the file. > > > The data is compressed, so it can't contain zeros originally. Of course it's > possible the application zeroed that page after preparing the buffer and > before the write hit the disk, but that's fairly unlikely. Zeroing pages is > a kernel thing; even if the application allocated 4k of memory (not very > common, but it does happen), it wouldn't zero it; and that buffer of course > is held during the write. > > We're adding code to check the buffer before and after the write, and also > read back from disk. > >> >>> Does this trigger anything in anyone's mind? >> >> Nope - do you have a reproducer you can share? >> > > Run a certain NoSQL database for months on a cluster with lots of activity, > and _may_ see it a few time. It's very rare, but it's there. > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html