Re: Initial sync

Ravishankar N <ravishankar@xxxxxxxxxx> · Thu, 06 Nov 2014 20:54:03 +0530

    On 11/05/2014 10:53 PM, Andreas Hollaus
      wrote:

      Hi,

        Maybe my GlusterFS source code is a bit old, but these scripts
        seem to be referred to as filters. 

    I do not know what the below references are either.

       ### glusterd-volgen.c ###

        .

        .

        .

        static void

        volgen_apply_filters (char *orig_volfile)

        {

            DIR           *filterdir = NULL;

            struct dirent  entry = {0,};

            struct dirent *next = NULL;

            char          *filterpath = NULL;

            struct stat    statbuf = {0,};

            filterdir = opendir(FILTERDIR);

        .

        .

        .

        Anyway, I was previously told to use a sed script as the volume
        files will be overwritten whenever an option is set using a CLI
        command. I have created such a script, but I wonder where I
        shall store it (according to the Makefile, the path depends on
        installation dir and release version and I'm only sure about the
        latter)?

        Maybe if I, as you say, edit the file before the volume is
        started this favorite-child setting will be read and then part
        of the volume settings that are stored whenever a CLI command is
        executed. No, I just tried this and the favorite-child option
        was removed when I later on set a new ping-timeout value. Seems
        like a script is required after all to make sure that the
        setting is persistent. I hope though that this setting is read
        from the volume file and handled, even though it is not
        rewritten in case I set other options(?).

    Right, edits to volfile will be lost when a new volume set operation
    is made as the files get rewritten. So you would have to use hook
    scripts[1]. Since you want

    to retain the fav-child option after every volume set operations,
    you must place your sed script in
    /var/lib/glusterd/hooks/1/set/post/.

    Here is a nice example of how to use them:
    http://aravindavk.in/blog/effective-glusterfs-monitoring-using-hooks/

    [1]
    http://www.gluster.org/community/documentation/index.php/Features/Hooks

        By the way, you told me to edit the
        'trusted-<volname>.vol' file, but I also have a
        '<volname>.vol' file with similar contents. What's the
        difference between these and is it only the
        'trusted-<volname>.vol' that is supposed to be edited?

    Sorry I missed out the difference. If the client (mount) is on the
    same node as the server(s) you need to edit the trusted.*.vol file.
    If your client is a separate node, you need to edit the other one.

    HTH,

    Ravi

        Regards

        Andreas

        On 11/05/14 16:43, Ravishankar N wrote:

        On 11/05/2014 06:54 PM, Andreas
          Hollaus wrote:

          On 11/05/14 12:23, Ravishankar N wrote:

            On 11/05/2014 03:18 PM, Andreas Hollaus wrote:

              Hi,

I'm curious about this 5 phase transaction scheme that is described in the document
(lock, pre-op, op, post-op, unlock).
Are these stage switches all triggered from the client or can the server do it
without notifying the client, for instance switching from 'op' to 'post-op'?

            All stages are performed by the AFR translator in the client graph, where it is
loaded, in the sequence you listed.

          So the counters are stored on the servers (as extended attributes on the bricks), but
increased and decreased by the client after fetching them from the servers? If so, I
guess that the messages between those are just synchronous file system operations
like read extended attributes, write file etc.

        You got it right. Lock the file on the bricks, set xattrs on
        bricks, write to bricks, clear xattrs on bricks (success case),
        unlock file on bricks.

          Is the client created whenever a GlusterFS volume is mounted?

        Correct. You give the hostname+volume name to mount process
        which it uses to fetches the volfile graph from the server,
        reads it and loads the appropriate xlators.

           As I'm running both
server and client on the same board it's a bit hard to distinguish them from each other.

              Decreasing the counter for the local pending operations could be done without talking
to the client, even though I realize a message has to sent to the other server(s),
possibly through the client.

The reason I ask is that I'm trying to estimate the risk of ending up in a split
brain situation, or at least understand if our servers will 'accuse' each other
temporarily during this 5 phase transaction under normal circumstances. If I
understand who sends messages to who and I what order, I'll have a better chance to
see if we require any solution to split brain situations. As I've experienced
problems to setup the 'favorite-child' option, I want to know if it's required or
not. In our use case, quorum is not a solution, but losing some data is acceptable as
long as the bricks are in sync.

            If a file is split-brained, AFR does not allow modifications  by clients on it
until the split-brain is resolved. The afr xattrs and heal mechanisms ensure that
the bricks are in sync, so worries on that front.

          I know about the input/output error in case of a split brain and that is something we
must avoid at any cost. That's the reason why 'favorite-child' seems like a good idea
for us, but my filter script is not executed even though I tried a couple of probable
locations to store it at. It's a bit hard to be absolutely sure what that filter path
macro contained at the time the GlusterFS package was built. It would have been
easier if the path existed, even though it was empty if no filters were used.
According to the source code, there are some return statements due to errors that
could also be the reason for not running the filter script. Are there any ways to set
verbose level to get some more clues to what's going on?

        Not sure I follow you on what a filter script is (hook
        scripts?), but yes, you can use the  favourite-child option to
        pick the source for split-brained files. I don't think it's a
        supported/tested feature though. It can't be set using gluster
        CLI. You will have to edit the volfile manually and add this
        option before starting the volume like so:

        #cat /var/lib/glusterd/vols/testvol/trusted-testvol-fuse.vol

        <snip>

        volume testvol-replicate-0

            type cluster/replicate

            option favorite-child testvol-client-1

            subvolumes testvol-client-0 testvol-client-1

        end-volume

        </snip>

        -Ravi

          Regards
Andreas

            Thanks,
Ravi

              Regards
Andreas

On 10/31/14 15:37, Ravishankar N wrote:

                On 10/30/2014 07:23 PM, Andreas Hollaus wrote:

                  Hi,

Thanks! Seems like an interesting document. Although I've read blogs about how
extended attributes are used as a change log, this seams like a more comprehensive
document.

I won't write directly to any brick. That's the reason I first have to create a
volume which consists of only one brick, until the other server is available, and
then add that second brick. I don't want to delay the file system clients until the
second server is available, hence the reason for add-brick.

I guess that this procedure is only needed the first time the volume is configured,
right? If any of these bricks would fail later on, the change log would keep
track of
all changes to the file system even though only one of the bricks is available(?).

                Yes, if one one brick of a replica pair goes down, the other one keeps track of
file modifications by the client, and would sync it back to the first one when it
comes back up.

                  After a restart, volume settings stored in the configuration file would be accepted
even though not all servers were up and running yet at that time, wouldn't they?

                glusterd running on all nodes ensures that the volume configurations stored on each
node are in sync.

                  Speaking about configuration files. When are these copied to each server?
If I create a volume which consists of two bricks, I guess that those servers will
create the configuration files, independently of each other, from the information
sent from the client (gluster volume create...).

                All volume config/management commands must be run from any of the servers that make
up the volume and not the client (unless both happen to be in the same machine). As
mentioned above, when any of the volume commands are run on any one server,
glusterd orchestrates the necessary action on all servers and keeps them in sync.

                     In case I later on add a brick, I guess that the settings have to be copied
to the
new brick after they have been modified on the first one, right (or will they be
recreated on all servers from the information specified by the client, like in the
previous case)?

Will configuration files be copied in other situations as well, for instance in
case
one of the servers which is part of the volume for some reason would be missing
those
files? In my case, the root file system is recreated from an image at each
reboot, so
everything created in /etc will be lost. Will GlusterFS settings be restored
from the
other server automatically

                No, it is expected that servers have persistent file-systems.  There are ways to
restore such bricks; see
http://gluster.org/community/documentation/index.php/Gluster_3.4:_Brick_Restoration_-_Replace_Crashed_Server

-Ravi

                  or do I need to backup and restore those myself? Even
though the brick doesn't know that it is part of a volume in case it lose the
configuration files, both the other server(s) and the client(s) will probably
recognize it as being part of the volume. I therefore believe that such a
self-healing would actually be possible, even though it may not be implemented.

Regards
Andreas
  On 10/30/14 05:21, Ravishankar N wrote:

                    On 10/28/2014 03:58 PM, Andreas Hollaus wrote:

                      Hi,

I'm curious about how GlusterFS manages to sync the bricks in the initial phase,
when
the volume is created or
extended.

I first create a volume consisting of only one brick, which clients will start to
read and write.
After a while I add a second brick to the volume to create a replicated volume.

If this new brick is empty, I guess that files will be copied from the first
brick to
get the bricks in sync, right?

However, if the second brick is not empty but rather contains a subset of the
files
on the first brick I don't see
how GlusterFS will solve the problem of syncing the bricks.

I guess that all files which lack extended attributes could be removed in this
scenario, because they were created
when the disk was not part of a GlusterFS volume. However, in case the brick was
used
in the volume previously,
for instance before that server restarted, there will be extended attributes for
the
files on the second brick which
weren't updated during the downtime (when the volume consisted of only one
brick).
There could be multiple
changes to the files during this time. In this case I don't understand how the
extended attributes could be used to
determine which of the bricks contains the most recent file.

Can anyone explain how this works? Is it only allowed to add empty bricks to a
volume?

                    It is allowed to add only empty bricks to the volume. Writing directly to
bricks is
not supported. One needs to access the volume only from a mount point or using
libgfapi.
After adding a brick to increase the distribute count, you need to run the volume
rebalance command so that the some of the existing files are hashed (moved) to
this
newly added brick.
After adding a brick to increase the replica count, you need to run the volume
heal
full command to sync the files from the other replica into the newly added brick.
https://github.com/gluster/glusterfs/blob/master/doc/features/afr-v1.md will give
you an idea of how the replicate translator uses xattrs to keep files in sync.

HTH,
Ravi

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users