On 02/20/2013 07:03 AM, Anand Avati
wrote:
On Tue, Feb 19, 2013 at 5:12 PM, Anand
Avati <anand.avati@xxxxxxxxx>
wrote:
On Tue, Feb 19, 2013 at 3:59 AM, Pranith
Kumar K <pkarampu@xxxxxxxxxx>
wrote:
On 02/19/2013 11:26 AM, Anand Avati wrote:
Thinking over this, looks like there is a
problem!
Write-behind guarantees: That a second
write request arriving after the
acknowledgement of a first overlapping
request (whether written-behind or
otherwise) will be guaranteed to be
fulfilled in the backend in the same order
(i.e, the second overlapping request will be
"serialized" behind the first one in the
fulfillment process)
Eager-lock requirement: That write-behind
will send no two write requests on an
overlapping region at the same time.
The requirement-set and guarantee-set have
a big overlap, but the requirement-set is
not a subset.
This is because of O_SYNC writes.
write-behind performs write-serialization at
fulfillment only for written behind requests
(which get covered under the conflict
detection code during liability
fulfillment). However, if two threads (or
apps) issue overlapping O_SYNC writes to the
same region at approx same time, then
write-behind will let both of them go by
without any kind of serialization, into
eager lock, violating the assumptions!
I'm wondering if it is a safer idea to
implement overlap checks within eager-lock
code itself rather than depend on
write-behind :|
Avati
On Mon, Feb 11, 2013
at 10:07 PM, Anand Avati <anand.avati@xxxxxxxxx>
wrote:
On Mon, Feb 11, 2013 at 9:32 PM,
Pranith Kumar K <pkarampu@xxxxxxxxxx>
wrote:
hi,
Please note that this is a case in
theory and I did not run into such
situation, but I feel it is
important to address this.
Configuration with 'Eager-lock on"
and "write-behind off" should not
be allowed as it leads to lock
synchronization problems which
lead to data in-consistency among
replicas in nfs.
lets say bricks b1, b2 are in
replication.
Gluster Nfs server uses 1
anonymous fd to perform all
write-fops. If eager-lock is
enabled in afr, the lock-owner is
used as fd's address which will be
same for all write-fops, so there
will never be any inodelk
contention. If write-behind is
disabled, there can be writes that
overlap. (Does nfs makes sure that
the ranges don't overlap?)
Now imagine the following
scenario:
lets say w1, w2 are 2 write fops
on same offset and length. w1 with
all '0's and w2 with all '1's. If
these 2 write fops are executed in
2 different threads, the order of
arrival of write fops on b1 can be
w1, w2 where as on b2 it is w2, w1
leading to data inconsistency
between the two replicas. The lock
contention will not happen as both
lk-owner, transport are same for
these 2 fops.
Write-behind has to functions - a)
performing operations in the
background and b) serializing
overlapping operations.
While the problem does exist, the
specifics are different from what you
describe. since all writes coming in
from NFS will always use the same
anonymous FD, two
near-in-time/overlapping writes will
never contend with inodelk() but
instead the second write will inherit
the lock and changelog from the first.
In either case, it is a problem.
We can add a check
in glusterd for volume set to
disallow such configuration, BUT
by default write-behind is off in
nfs graph and by default
eager-lock is on. So we should
either turn on write-behind for
nfs or turn off eager-lock by
default.
Could you please suggest how to
proceed with this if you agree
that I did not miss any important
detail that makes this theory
invalid.
It seems loading write-behind
xlator in NFS graph looks like a
simpler solution. eager-locking is
crucial for replicated NFS write
performance.
Avati
Shall we disable eager-lock for files opened with
O_SYNC, for now?
Bad news: the problem is slightly worse than just this.
Even with non-O_SYNC writes, there is a possibility in
write-behind where, if a second overlapping write request
comes so close to the first request that, if wb_enqueue()
of the second one happens after wb_enqueue() of the first
write, but before any unwind() after the first
wb_enqueue() (i.e wb_inode->gen is not bumped), then
the two write requests can be wound down together to eager
lock.
But this has a simple fix - http://review.gluster.org/4550.
Disabling eager-locking for O_SYNC files is a bad idea. We
absolutely want eager-locking for O_SYNC files. Thinking
more..
Avati
Why is disabling eager-lock for O_SYNC files a bad idea? It is
acceptable to sacrifice a bit of performance for O_SYNC isn't it?
Pranith.
|