Fwd: Question about EC locking

jayakrishnan mm <jayakrishnan.mm@xxxxxxxxx> · Thu, 2 Feb 2017 15:12:32 +0800

On Fri, Jan 13, 2017 at 8:03 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
Hi,

On 13/01/17 10:58, jayakrishnan mm wrote:

Hi Xavier,

I went through the source  code. Some questions remain.

1. If two clients try to write to same file, it should succeed, even if

they overlap. (Locks should ensure it happens in sequence, in the bricks).

from the source code

         lock->flock.l_type = F_WRLCK;

         lock->flock.l_whence = SEEK_SET;

            fop->flock.l_len += ec_adjust_offset(fop->xl->private,

                                                 &fop->flock.l_start, 1);

            fop->flock.l_len = ec_adjust_size(fop->xl->private,

                                              fop->flock.l_len, 1);

if flock.l_len is 0, the entire file  is locked for writing

In my test case  with 2 clients, I always  get  flock.l_len as 0. But

still  I am able to write to the same file  from both clients at the

 same time.

How are you sure you are really writing at the same time ? do you get partial writes from some of the client ?

I am not sure, if they are happening simultaneously. I am using  fio to do that.

If it is  acquiring lock chunk by chunk, why I am getting l_len =0

always ?

EC doesn't acquire partial locks. The entire file is locked when a modification is needed. This makes possible to reuse locks for future operations (eager locking).

Why I am not getting the actual write size  and offset f(for

flock.l_len & flock.l_start respectively) for each  write FOP ?

(In afr , it is set to transaction.len transaction.start respectively,

which in turn is  write length & offset  for the normal write case)

Because an erasure code splits the data is smaller fragments for each brick, so offsets and lengths need to be adjusted.

2. As per source code ,a full file lock is taken by the shd also.

ec_heal_inodelk(heal, F_WRLCK, 1, 0, 0);

 which means  offset=0 & size=0  in  ec_heal_lock() function in ec-heal.c

flock.l_start = offset;

flock.l_len = size;

Does it mean , in a single file write cannot happen simultaneously with

healing?

Correct. Heal procedure is like an additional client. If a client and the heal process try to write at the same time, they must be serialized, like any other regular write. However heal only takes the full lock for some critical operations. Regular self heal of file contents is done locking chunk by chunk.

Have got a question about index heal/full heal.
As per the code, index healer thread (ec_shd_index_healer)is created  when there is a child_up event OR  when there is a TRANSLATOR_OP/GF_SHD_OP_HEAL_INDEX.  When does the second case arise ?  

Full heal  thread(ec_shd_full_healer) is created  only when TRANSLATOR_OP/GF_SHD_OP_HEAL_FULL arise. Does this happen during replace brick condition only ?

Thanks & regards
JK

Xavi

Correct me , if I am wrong.

Best Regards

JK

On Wed, Dec 14, 2016 at 12:07 PM, jayakrishnan mm

<jayakrishnan.mm@xxxxxxxxx <mailto:jayakrishnan.mm@gmail.com>> wrote:

    Thanks Xavier, for making it clear.

    Regards

    JK

    On Dec 13, 2016 3:52 PM, "Xavier Hernandez" <xhernandez@xxxxxxxxxx

    <mailto:xhernandez@xxxxxxxxxx>> wrote:

        Hi JK,

        On 12/13/2016 08:34 AM, jayakrishnan mm wrote:

            Dear Xavi,

            How do I test  the locks, for example locks  for write fop.

            I have two

            clients(independent), both  are  trying to write to same file.

            1. According to my understanding, both  can successfully

            write  if the

            offsets don't overlap . I mean, the WRITE FOP  takes a chunk

            lock on the

            file . As

            long as the clients don't try  to write to the same chunk,

            it should be

            OK. If no locks  present, it can lead to inconsistency.

        With locks all writes will be fine as defined by posix (i.e. the

        final result will be equivalent to the sequential execution of

        both operations, though in an undefined order), even if they

        overlap. Without locks, there are chances that some bricks

        execute the operations in one order and the remaining bricks

        execute the same operations in the reverse order, causing data

        corruption.

            2.  Different FOPs can always run simultaneously. (Example

            WRITE  and

            READ FOPs, or  two READ FOPs).

        All fops can be executed concurrently. If there's any chance

        that two operations could interfere, locks are taken in the

        appropriate places. For example, reads cannot be merged with

        overlapping writes. Otherwise they could return inconsistent data.

            3. WRITE & some metadata FOP (like setattr)  together .

            Cannot happen

            together with locks , even though chances  are very low.

        As in 2, if there's any possible interference, the appropriate

        locks will be taken.

        You can look at the code to see which locks are taken for each

        fop. See the corresponding ec_manager_<fop>() function, in the

        EC_STATE_LOCK switch case. There you will see calls to

        ec_lock_prepare_xxx() for each taken lock.

        Xavi

            Pls. clarify.

            Best regards

            JK

            On Wed, Nov 30, 2016 at 5:49 PM, jayakrishnan mm

            <jayakrishnan.mm@xxxxxxxxx

            <mailto:jayakrishnan.mm@gmail.com>

            <mailto:jayakrishnan.mm@gmail.com

            <mailto:jayakrishnan.mm@gmail.com>>> wrote:

                Hi Xavier,

                Thank you very much for your explanation. This helped  me to

                understand  more  about  locking in EC.

                Best Regards

                JK

                On Mon, Nov 28, 2016 at 4:17 PM, Xavier Hernandez

                <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

            <mailto:xhernandez@xxxxxxxxxx

            <mailto:xhernandez@xxxxxxxxxx>>> wrote:

                    Hi,

                    On 11/28/2016 02:59 AM, jayakrishnan mm wrote:

                        Hi Xavier,

                        Notice  that EC xlator uses blocking locks. Any

            specific

                        reason for this?

                    In a distributed filesystem like gluster a

            synchronization

                    mechanism is a must to avoid data corruption.

                        Do you think this will  affect the  performance ?

                    Of course the need for locks has a performance

            impact, and we

                    cannot avoid them to guarantee data integrity.

            However some

                    optimizations have been applied, specially the eager

            locking

                    which allows a lock to be reused without

            unlocking/locking again.

                        (In comparison AFR  first tries  non blocking

            locks  and if not

                        successful, tries blocking locks then)

                    EC also tries a non-blocking lock first.

                        Also, why two locks  are  needed  per FOP ? One

            for normal

                        I/O and

                        another for self healing?

                    The only fop that currently needs two locks is

            'rename', and

                    only when source and destination directories are

            different. All

                    other fops only take one lock at most.

                    Best regards,

                    Xavi

                        Best regards

                        JK

                        _______________________________________________

                        Gluster-devel mailing list

                        Gluster-devel@xxxxxxxxxxx

            <mailto:Gluster-devel@gluster.org>

            <mailto:Gluster-devel@gluster.org

            <mailto:Gluster-devel@gluster.org>>

            http://www.gluster.org/mailman/listinfo/gluster-devel

            <http://www.gluster.org/mailman/listinfo/gluster-devel>

            <http://www.gluster.org/mailman/listinfo/gluster-devel

            <http://www.gluster.org/mailman/listinfo/gluster-devel>>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel