Re: Intermittent zeroed pages with AIO+DIO+XFS

"Raphael S. Carvalho" <raphaelsc@xxxxxxxxxxxx> · Fri, 4 Aug 2017 01:04:22 -0300

On Fri, Aug 4, 2017 at 12:36 AM, Avi Kivity <avi@xxxxxxxxxxxx> wrote:
> On 08/04/2017 06:14 AM, Dave Chinner wrote:
>>
>> On Fri, Aug 04, 2017 at 05:40:07AM +0300, Avi Kivity wrote:
>>>
>>> On 08/04/2017 01:09 AM, Dave Chinner wrote:
>>>>
>>>> On Thu, Aug 03, 2017 at 05:52:45PM +0300, Avi Kivity wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>> Hi Avi,
>>>>
>>>>> I have an application that uses AIO+DIO to write data to a file on
>>>>> XFS. The writes use 128k buffers. Very rarely, I see aligned 4k
>>>>> blocks within the file that are zeroed. The blocks are not aligned
>>>>> to 128k boundary, just 4k. The buffers are allocated in anonymous
>>>>> memory, which is usually using transparent hugepages.  The files are
>>>>> fully allocated, not sparse (checked post-mortem).
>>>>
>>>> Did you check that the extents are written? i.e. there aren't
>>>> sporadic 4k unwritten extents in the file? (xfs_bmap -vvp output)
>>>
>>> Raphael did that, and the result was that the file was NOT sparse.
>>
>> Sure, but a file with unwritten extents is not sparse. It's just got
>> extents that will always read as zeros. The extra "-vvp" output
>> tells you the unwritten flag state and does not merge contiguous
>> extents that differ only in state.
>
>
> Ah, thanks for the explanation. Raphael, can you check this?

Hi, everyone.

All extents have the flag 01111, which if I understand correctly, they
are everything but unwritten.

I was curious if there's any chance there's still an unknown bug which
is somewhat related to this one:
http://oss.sgi.com/archives/xfs/2015-04/msg00159.html. We no longer
submit size-changing ops in parallel though, they're now serialized. I
checked that kernel of the system which reproduced this issue contains
the fix aforementioned.

>
>
>> i.e:
>>
>> $ sudo xfs_io -fd -c "falloc 0 1M" -c "pwrite 900k 200k" /mnt/scratch/foo
>> wrote 204800/204800 bytes at offset 921600
>> 200 KiB, 50 ops; 0.0000 sec (13.838 MiB/sec and 3542.5818 ops/sec)
>> $ sudo xfs_bmap /mnt/scratch/foo
>> /mnt/scratch/foo:
>>          0: [0..2199]: 160..2359
>>
>> Looks fully allocated. However:
>>
>> $ sudo xfs_bmap -vvp /mnt/scratch/foo
>> /mnt/scratch/foo:
>>   EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..1799]:       160..1959          0 (160..1959)       1800 010000
>>     1: [1800..2199]:    1960..2359         0 (1960..2359)       400 000000
>>   FLAG Values:
>>      0100000 Shared extent
>>      0010000 Unwritten preallocated extent
>>      0001000 Doesn't begin on stripe unit
>>      0000100 Doesn't end   on stripe unit
>>      0000010 Doesn't begin on stripe width
>>      0000001 Doesn't end   on stripe width
>> $
>>
>> The first 900k of the file is an unwritten extent, which returns
>> zeros...
>>
>>> btw, we also run with the extent size hint set to 32MB.
>>
>> Which means that space is definitely being allocated as unwritten
>> extents, then overwritten and converted on IO completion. Hence if
>> the overwrite is not complete, or there's a bug in the unwritten
>> extent conversion, it may leave unwritten extents where it
>> shouldn't....
>>
>>>> What kernel version is this seen on? We've changed the XFS DIO
>>>> IO path implementation substantially in recent times....
>>>
>>> CentOS 7.2's kernel. Glauber, do you now the precise version string?
>>
>> Can you reproduce on an upstream kernel? Problems with highly
>> patched distro kernels really need to be directed to the distro...
>
>
> This is a production cluster, and we've only seen the problem in this one
> cluster, and _very_ rarely there.
>
>>>>> Does this trigger anything in anyone's mind?
>>>>
>>>> Nope - do you have a reproducer you can share?
>>>>
>>> Run a certain NoSQL database for months on a cluster with lots of
>>> activity, and _may_ see it a few time. It's very rare, but it's
>>> there.
>>
>> Needle in a haystack, then - the problem could be anywhere in the
>> storage stack, including hardware.
>
>
> Yes, unfortunately.
>
>>   You're going to need to
>> isolate the problem to the filesystem for us, which means a
>> reproducer script of some kind...
>
>
> It's very unlikely we'll find a simple reproducer; this email was more to
> see if the list has seen this problem before rather than as a detailed bug
> report.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html