Fwd: How caches are working on AFR?

stas.oskin at gmail.com (Stas Oskin) · Mon, 9 Mar 2009 16:19:34 +0200

---------- Forwarded message ----------
From: Anand Babu Periasamy <ab at gluster.com>
Date: 2009/3/9
Subject: Re: How caches are working on AFR?
To: Stas Oskin ?stas.oskin at gmail.com?
Cc: Gluster General Discussion List ?gluster-users at gluster.org?

Stas Oskin wrote:

> Hi.
>
> 2009/3/8 Anand Babu Periasamy <ab at gluster.com <mailto:ab at gluster.com>>
>
>    Replicate in 2.0 performs atomic writes by default. This means,
>    writes will return control
>    back to application only after both the volumes (or more) are
>    successfully written.
>
>
> Ok, so without write-behind cache, only when data physically written to all
> AFR disk, the app would continue?
>

Yes. Preciously speaking, when data is handed over to underlying diskfs
and not physically written to disk. It may be written or journaled.

Every parallel write operation is a transaction. It has to complete
atomically on all volumes. If a volume is down, incomplete files
are marked pending. It doesn't block then.

    To mask the performance penalty of atomic writes, you should load
>    write-behind on top of
>    it. Write-behind returns control as soon as it receives the write
>    call from the
>    application, but it continues to write in background. Write-behind
>    also performs
>    block-aggregation. Smaller writes are aggregated into fewer large
>    writes.
>
>    POSIX says application should verify the return status of close
>    system call to ensure all
>    writes were successfully written. If they are any pending writes,
>    close call will block to
>     ensure all the data is completely written. There is an option in
>    write-behind to even
>    close in background. It is unsafe and turned off by default.
>
>
> So I need to call close() per each file (which should be done nevertheless
> for correct operations), in order to insure all was written to disk?
>
> And if the close() fails - this means some of the data was lost?
>
>  Yes correct. This behavior is expected even for regular disk file systems.

If you want every write to be physically written to disk, you should
either open with O_DIRECT or flush or use appropriate file system APIs
for synchronous writes. GlusterFS respects all the flags/APIs and turns off
write-behind or any such optimizations appropriately.

>    Applications that expect every write to succeed, issues synchronous
>    writes.
>
>
> By this you mean that no write-behind should be used, only the default
> atomic writes behavior?
>

No, Write-behind is good. Even NFS and regular disk file systems behave
exactly like this.  See the excerpt from GNU Glibc reference manual below.

In GlusterFS, all of the functionalities including basic performance
features are implemented as modules. You will get awful performance
with out these modules loaded. You can only expect GlusterFS to
be functionally right.

--------[ FROM GLIBC DOC ]--------------------------------
for write (..)
    Once `write' returns, the data is enqueued to be written and can be
    read back right away, but it is not necessarily written out to
    permanent storage immediately.  You can use `fsync' when you need
    to be sure your data has been permanently stored before
    continuing.  (It is more efficient for the system to batch up
    consecutive writes and do them all at once when convenient.
    Normally they will always be written to disk within a minute or
    less.)  Modern systems provide another function `fdatasync' which
    guarantees integrity only for the file data and is therefore
    faster.  You can use the `O_FSYNC' open mode to make `write' always
    store the data to disk before returning;

for close (..)
`ENOSPC'
`EIO'
`EDQUOT'
    When the file is accessed by NFS, these errors from `write'
    can sometimes not be detected until `close'.  *Note I/O
    Primitives::, for details on their meaning.
----------------------------------------------------------

-- 
Anand Babu Periasamy
GPG Key ID: 0x62E15A31
Blog [http://ab.multics.org]
GlusterFS [http://www.gluster.org]
The GNU Operating System [http://www.gnu.org]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zresearch.com/pipermail/gluster-users/attachments/20090309/d501d936/attachment-0001.htm>