---------- Forwarded message ---------- From: Anand Babu Periasamy <ab at gluster.com> Date: 2009/3/9 Subject: Re: How caches are working on AFR? To: Stas Oskin ?stas.oskin at gmail.com? Cc: Gluster General Discussion List ?gluster-users at gluster.org? Stas Oskin wrote: > Hi. > > 2009/3/8 Anand Babu Periasamy <ab at gluster.com <mailto:ab at gluster.com>> > > Replicate in 2.0 performs atomic writes by default. This means, > writes will return control > back to application only after both the volumes (or more) are > successfully written. > > > Ok, so without write-behind cache, only when data physically written to all > AFR disk, the app would continue? > Yes. Preciously speaking, when data is handed over to underlying diskfs and not physically written to disk. It may be written or journaled. Every parallel write operation is a transaction. It has to complete atomically on all volumes. If a volume is down, incomplete files are marked pending. It doesn't block then. To mask the performance penalty of atomic writes, you should load > write-behind on top of > it. Write-behind returns control as soon as it receives the write > call from the > application, but it continues to write in background. Write-behind > also performs > block-aggregation. Smaller writes are aggregated into fewer large > writes. > > POSIX says application should verify the return status of close > system call to ensure all > writes were successfully written. If they are any pending writes, > close call will block to > ensure all the data is completely written. There is an option in > write-behind to even > close in background. It is unsafe and turned off by default. > > > So I need to call close() per each file (which should be done nevertheless > for correct operations), in order to insure all was written to disk? > > And if the close() fails - this means some of the data was lost? > > Yes correct. This behavior is expected even for regular disk file systems. If you want every write to be physically written to disk, you should either open with O_DIRECT or flush or use appropriate file system APIs for synchronous writes. GlusterFS respects all the flags/APIs and turns off write-behind or any such optimizations appropriately. > Applications that expect every write to succeed, issues synchronous > writes. > > > By this you mean that no write-behind should be used, only the default > atomic writes behavior? > No, Write-behind is good. Even NFS and regular disk file systems behave exactly like this. See the excerpt from GNU Glibc reference manual below. In GlusterFS, all of the functionalities including basic performance features are implemented as modules. You will get awful performance with out these modules loaded. You can only expect GlusterFS to be functionally right. --------[ FROM GLIBC DOC ]-------------------------------- for write (..) Once `write' returns, the data is enqueued to be written and can be read back right away, but it is not necessarily written out to permanent storage immediately. You can use `fsync' when you need to be sure your data has been permanently stored before continuing. (It is more efficient for the system to batch up consecutive writes and do them all at once when convenient. Normally they will always be written to disk within a minute or less.) Modern systems provide another function `fdatasync' which guarantees integrity only for the file data and is therefore faster. You can use the `O_FSYNC' open mode to make `write' always store the data to disk before returning; for close (..) `ENOSPC' `EIO' `EDQUOT' When the file is accessed by NFS, these errors from `write' can sometimes not be detected until `close'. *Note I/O Primitives::, for details on their meaning. ---------------------------------------------------------- -- Anand Babu Periasamy GPG Key ID: 0x62E15A31 Blog [http://ab.multics.org] GlusterFS [http://www.gluster.org] The GNU Operating System [http://www.gnu.org] -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://zresearch.com/pipermail/gluster-users/attachments/20090309/d501d936/attachment-0001.htm>