Re: Initial sync

Andreas Hollaus <Andreas.Hollaus@xxxxxxxxxxxx> · Wed, 5 Nov 2014 10:48:39 +0100

Hi,

I'm curious about this 5 phase transaction scheme that is described in the document
(lock, pre-op, op, post-op, unlock).
Are these stage switches all triggered from the client or can the server do it
without notifying the client, for instance switching from 'op' to 'post-op'?
Decreasing the counter for the local pending operations could be done without talking
to the client, even though I realize a message has to sent to the other server(s),
possibly through the client.

The reason I ask is that I'm trying to estimate the risk of ending up in a split
brain situation, or at least understand if our servers will 'accuse' each other
temporarily during this 5 phase transaction under normal circumstances. If I
understand who sends messages to who and I what order, I'll have a better chance to
see if we require any solution to split brain situations. As I've experienced
problems to setup the 'favorite-child' option, I want to know if it's required or
not. In our use case, quorum is not a solution, but losing some data is acceptable as
long as the bricks are in sync.

Regards
Andreas

On 10/31/14 15:37, Ravishankar N wrote:
> On 10/30/2014 07:23 PM, Andreas Hollaus wrote:
>> Hi,
>>
>> Thanks! Seems like an interesting document. Although I've read blogs about how
>> extended attributes are used as a change log, this seams like a more comprehensive
>> document.
>>
>> I won't write directly to any brick. That's the reason I first have to create a
>> volume which consists of only one brick, until the other server is available, and
>> then add that second brick. I don't want to delay the file system clients until the
>> second server is available, hence the reason for add-brick.
>>
>> I guess that this procedure is only needed the first time the volume is configured,
>> right? If any of these bricks would fail later on, the change log would keep track of
>> all changes to the file system even though only one of the bricks is available(?).
>
> Yes, if one one brick of a replica pair goes down, the other one keeps track of
> file modifications by the client, and would sync it back to the first one when it
> comes back up.
>
>> After a restart, volume settings stored in the configuration file would be accepted
>> even though not all servers were up and running yet at that time, wouldn't they?
>
> glusterd running on all nodes ensures that the volume configurations stored on each
> node are in sync.
>>
>> Speaking about configuration files. When are these copied to each server?
>> If I create a volume which consists of two bricks, I guess that those servers will
>> create the configuration files, independently of each other, from the information
>> sent from the client (gluster volume create...).
>
> All volume config/management commands must be run from any of the servers that make
> up the volume and not the client (unless both happen to be in the same machine). As
> mentioned above, when any of the volume commands are run on any one server,
> glusterd orchestrates the necessary action on all servers and keeps them in sync.
>>   In case I later on add a brick, I guess that the settings have to be copied to the
>> new brick after they have been modified on the first one, right (or will they be
>> recreated on all servers from the information specified by the client, like in the
>> previous case)?
>>
>> Will configuration files be copied in other situations as well, for instance in case
>> one of the servers which is part of the volume for some reason would be missing those
>> files? In my case, the root file system is recreated from an image at each reboot, so
>> everything created in /etc will be lost. Will GlusterFS settings be restored from the
>> other server automatically
> No, it is expected that servers have persistent file-systems.  There are ways to
> restore such bricks; see
> http://gluster.org/community/documentation/index.php/Gluster_3.4:_Brick_Restoration_-_Replace_Crashed_Server
>
> -Ravi
>> or do I need to backup and restore those myself? Even
>> though the brick doesn't know that it is part of a volume in case it lose the
>> configuration files, both the other server(s) and the client(s) will probably
>> recognize it as being part of the volume. I therefore believe that such a
>> self-healing would actually be possible, even though it may not be implemented.
>>
>>
>> Regards
>> Andreas
>>  
>> On 10/30/14 05:21, Ravishankar N wrote:
>>> On 10/28/2014 03:58 PM, Andreas Hollaus wrote:
>>>> Hi,
>>>>
>>>> I'm curious about how GlusterFS manages to sync the bricks in the initial phase,
>>>> when
>>>> the volume is created or
>>>> extended.
>>>>
>>>> I first create a volume consisting of only one brick, which clients will start to
>>>> read and write.
>>>> After a while I add a second brick to the volume to create a replicated volume.
>>>>
>>>> If this new brick is empty, I guess that files will be copied from the first
>>>> brick to
>>>> get the bricks in sync, right?
>>>>
>>>> However, if the second brick is not empty but rather contains a subset of the files
>>>> on the first brick I don't see
>>>> how GlusterFS will solve the problem of syncing the bricks.
>>>>
>>>> I guess that all files which lack extended attributes could be removed in this
>>>> scenario, because they were created
>>>> when the disk was not part of a GlusterFS volume. However, in case the brick was
>>>> used
>>>> in the volume previously,
>>>> for instance before that server restarted, there will be extended attributes for
>>>> the
>>>> files on the second brick which
>>>> weren't updated during the downtime (when the volume consisted of only one brick).
>>>> There could be multiple
>>>> changes to the files during this time. In this case I don't understand how the
>>>> extended attributes could be used to
>>>> determine which of the bricks contains the most recent file.
>>>>
>>>> Can anyone explain how this works? Is it only allowed to add empty bricks to a
>>>> volume?
>>>>
>>>>   
>>> It is allowed to add only empty bricks to the volume. Writing directly to bricks is
>>> not supported. One needs to access the volume only from a mount point or using
>>> libgfapi.
>>> After adding a brick to increase the distribute count, you need to run the volume
>>> rebalance command so that the some of the existing files are hashed (moved) to this
>>> newly added brick.
>>> After adding a brick to increase the replica count, you need to run the volume heal
>>> full command to sync the files from the other replica into the newly added brick.
>>> https://github.com/gluster/glusterfs/blob/master/doc/features/afr-v1.md will give
>>> you an idea of how the replicate translator uses xattrs to keep files in sync.
>>>
>>> HTH,
>>> Ravi
>

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users