On Jan 31, 2012, at 2:16 PM, Bernd Schubert wrote: > On 01/27/2012 12:21 AM, James Bottomley wrote: >> On Thu, 2012-01-26 at 17:27 +0100, Bernd Schubert wrote: >>> On 01/26/2012 03:53 PM, Martin K. Petersen wrote: >>>>>>>>> "Bernd" == Bernd Schubert<bernd.schubert@xxxxxxxxxxxxxxxxxx> writes: >>>> >>>> Bernd> We from the Fraunhofer FhGFS team would like to also see the T10 >>>> Bernd> DIF/DIX API exposed to user space, so that we could make use of >>>> Bernd> it for our FhGFS file system. And I think this feature is not >>>> Bernd> only useful for file systems, but in general, scientific >>>> Bernd> applications, databases, etc also would benefit from insurance of >>>> Bernd> data integrity. >>>> >>>> I'm attending a SNIA meeting today to discuss a (cross-OS) data >>>> integrity aware API. We'll see what comes out of that. >>>> >>>> With the Linux hat on I'm still mainly interested in pursuing the >>>> sys_dio interface Joel and I proposed last year. We have good experience >>>> with that I/O model and it suits applications that want to interact with >>>> the protection information well. libaio is also on my list. >>>> >>>> But obviously any help and input is appreciated... >>>> >>> >>> I guess you are referring to the interface described here >>> >>> http://www.spinics.net/lists/linux-mm/msg14512.html >>> >>> Hmm, direct IO would mean we could not use the page cache. As we are >>> using it, that would not really suit us. libaio then might be another >>> option then. >> >> Are you really sure you want protection information and the page cache? >> The reason for using DIO is that no-one could really think of a valid >> page cache based use case. What most applications using protection >> information want is to say: This is my data and this is the integrity >> verification, send it down and assure me you wrote it correctly. If you >> go via the page cache, we have all sorts of problems, like our >> granularity is a page (not a block) so you'd have to guarantee to write >> a page at a time (a mechanism for combining subpage units of protection >> information sounds like a nightmare). The write becomes mark page dirty >> and wait for the system to flush it, and we can update the page in the >> meantime. How do we update the page and its protection information >> atomically. What happens if the page gets updated but no protection >> information is supplied and so on ... The can of worms just gets more >> squirmy. Doing DIO only avoids all of this. > > Well, entirely direct-IO will not work anyway as FhGFS is a parallel network file system, so data are sent from clients to servers, so data are not entirely direct anymore. > The problem with server side storage direct-IO is that it is too slow for several work cases. I guess the write performance could be mostly solved somehow, but then still the read-cache would be entirely missing. From Lustre history I know that server side read-cache improved performance of applications at several sites. So I really wouldn't like to disable it for FhGFS... > I guess if we couldn't use the page cache, we probably wouldn't attempt to use DIF/DIX interface, but will calculate our own checksums once we are going to work on the data integrity feature on our side. This is interesting. I imagine the Linux kernel NFS server will have the same issue: it depends on the page cache for good performance, and does not, itself, use direct I/O. Thus it wouldn't be able to use a direct I/O-only DIF/DIX implementation, and we can't use DIF/DIX for end-to-end corruption detection for a Linux client - Linux server configuration. If high-performance applications such as databases demand corruption detection, it will need to work without introducing significant performance overhead. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html