Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 31 Jan 2012 14:21:48 -0500

On Jan 31, 2012, at 2:16 PM, Bernd Schubert wrote:

> On 01/27/2012 12:21 AM, James Bottomley wrote:
>> On Thu, 2012-01-26 at 17:27 +0100, Bernd Schubert wrote:
>>> On 01/26/2012 03:53 PM, Martin K. Petersen wrote:
>>>>>>>>> "Bernd" == Bernd Schubert<bernd.schubert@xxxxxxxxxxxxxxxxxx>   writes:
>>>> 
>>>> Bernd>   We from the Fraunhofer FhGFS team would like to also see the T10
>>>> Bernd>   DIF/DIX API exposed to user space, so that we could make use of
>>>> Bernd>   it for our FhGFS file system.  And I think this feature is not
>>>> Bernd>   only useful for file systems, but in general, scientific
>>>> Bernd>   applications, databases, etc also would benefit from insurance of
>>>> Bernd>   data integrity.
>>>> 
>>>> I'm attending a SNIA meeting today to discuss a (cross-OS) data
>>>> integrity aware API. We'll see what comes out of that.
>>>> 
>>>> With the Linux hat on I'm still mainly interested in pursuing the
>>>> sys_dio interface Joel and I proposed last year. We have good experience
>>>> with that I/O model and it suits applications that want to interact with
>>>> the protection information well. libaio is also on my list.
>>>> 
>>>> But obviously any help and input is appreciated...
>>>> 
>>> 
>>> I guess you are referring to the interface described here
>>> 
>>> http://www.spinics.net/lists/linux-mm/msg14512.html
>>> 
>>> Hmm, direct IO would mean we could not use the page cache. As we are
>>> using it, that would not really suit us. libaio then might be another
>>> option then.
>> 
>> Are you really sure you want protection information and the page cache?
>> The reason for using DIO is that no-one could really think of a valid
>> page cache based use case.  What most applications using protection
>> information want is to say: This is my data and this is the integrity
>> verification, send it down and assure me you wrote it correctly.  If you
>> go via the page cache, we have all sorts of problems, like our
>> granularity is a page (not a block) so you'd have to guarantee to write
>> a page at a time (a mechanism for combining subpage units of protection
>> information sounds like a nightmare).  The write becomes mark page dirty
>> and wait for the system to flush it, and we can update the page in the
>> meantime.  How do we update the page and its protection information
>> atomically.  What happens if the page gets updated but no protection
>> information is supplied and so on ...  The can of worms just gets more
>> squirmy.  Doing DIO only avoids all of this.
> 
> Well, entirely direct-IO will not work anyway as FhGFS is a parallel network file system, so data are sent from clients to servers, so data are not entirely direct anymore.
> The problem with server side storage direct-IO is that it is too slow for several work cases. I guess the write performance could be mostly solved somehow, but then still the read-cache would be entirely missing. From Lustre history I know that server side read-cache improved performance of applications at several sites. So I really wouldn't like to disable it for FhGFS...
> I guess if we couldn't use the page cache, we probably wouldn't attempt to use DIF/DIX interface, but will calculate our own checksums once we are going to work on the data integrity feature on our side.

This is interesting.  I imagine the Linux kernel NFS server will have the same issue: it depends on the page cache for good performance, and does not, itself, use direct I/O.

Thus it wouldn't be able to use a direct I/O-only DIF/DIX implementation, and we can't use DIF/DIX for end-to-end corruption detection for a Linux client - Linux server configuration.

If high-performance applications such as databases demand corruption detection, it will need to work without introducing significant performance overhead.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html