Re: Rant... WAS: [List-hacking] [bug #25207] an rm of a file should not cause that file to be replicated with afr self-heal.

Anand Babu Periasamy <ab@xxxxxxxxxxxxx> · Mon, 05 Jan 2009 05:23:03 -0800

Swank iest wrote:
It's a shame zresearch does not care to include the community in design. 
 Am I mistakenly under the impression that gluster is an Open Source 
project?

May be I miscommunicated. Only the intricate implementation details, we
discuss in person / phone. We do include the community and strongly value
their feedbacks. If you go through the mailing list and IRC archives you
will see a number of architectural discussions. Even the public roadmap
page has place holder for suggestions. Project itself is hosted under
Savannah. Source is under GPLv3 license. We are trying our best with the
limited resources we have.

For instance, you may find there is a large portion of the community who 
will feel that removing the file system's ability to heal itself is a 
bad thing.  I, for example, would find having to manually monitor the 
state of my clustered file system a rather expensive task.

I do, however, appreciate that it is a hard problem to solve.

We are not removing any ability, we are replacing it with a better one.
Self-healing code will re-implemented in a synchronous model through
an external tool. Currently it is the most complicated / unstable code
inside the file system. Stability is #1 priority for every one.

You can launch this tool through a cron-job. We are also planning to add
"daemon mode" support to receive notifications for real-time handling of
events (active healing).

I also believe that 

1) Being told that FreeBSD is only supported with version 7.0 and only 
with glusterfs 1.4 (which isn't released) is a bad thing.  Where is the 
stable code base?  Has development stopped on 1.3?  I feel pressure to 
be running 1.4, but it's not released yet.

Yes, only critical bug fixes happen on 1.3.

Release 2.0 (formerly 1.4) should happen this month. It is relatively
more stable than 1.3.

2) Being told that 1.4 release candidates are not a good "framework" to 
be solving problems in is scary.  If 1.4 isn't the correct place, where 
is?  Is there a 1.5 that hasn't been made public yet?  Is the AFR 
self-heal code going to be ripped out of 1.4?  When will it be ripped 
out?  I thought there was going to be a 1.4 release soon.  If 1.3 isn't 
stable, and 1.4 isn't a good framework, what should someone use in 
production?  Can only code that has been contracted from zresearch be 
used in production?  How much does this cost?

Self-heal code will not be removed until it is replaced with a better
alternative. Next 2.0 release will still have self-heal turned on by default.
Once we feel that glusterfs-heal is ready, we will turn self-heal off by
default. We will not remove features without discussing with the community.

GlusterFS code is the same both for commercial and gratis users. We do not
hold any code as proprietary. Commercial users pay for the subscription
package which is support and service for GlusterFS. We deploy, hand-hold
and maintain. (Similar to RedHat, except we don't restrict redistribution
of binaries).

3) Talking about features in a public forum may lead to a better end 
result.  For instance it may lead to feedback such as:

We always do that. Healing tool is already there on the roadmap. It was
not supposed to be introduced in this release. But we are planning to
make it available as part of 2.0.X minor release, instead of waiting
for 2.1.

This discussion came up, because you requested an optimization that
requires a hacky implementation. I won't complicate the current self-heal
design any further. It is achievable easily using the new design.

AFR is broken in a number of ways right now

1) AFR blocks on self-heal.  ls -lR will not return until the heal is 
complete.  On large directories, this will make many applications break 
in wonderfully weird ways.  I'm imagining users of web applications that 
have files backed on gluster clicking refresh for 30 minutes.

2) AFR self-heal is incredibly slow.  I have tracked this down to the 
use of 4kb "chunks" being sent at a time.  The explanation for this is 
to allow "spare file replication".  However, the additional TCP overhead 
that using such small chunks causes means that self-heal will run at 
speeds less than 1MBps in my environment (I'm attempting to run gluster 
over a VPN between data centers.)  I believe that the tcp chunk should 
be tied to the TCP window size.  I have set the 4kb size to 131072 in my 
environment to get things to work a bit better (however, without 
aggregation of small files, there is still an unnecessary amount of TCP 
overhead which causes small files to be replicated really slowly.)

This was one of the reasons to implement a healing tool. It gives more
control to the user. Currently it is hard for the user to track when and
how the healing happens.

4KB chunk healing is fixable. I will look in to it.

I really appreciate your feedback and in-depth details. Also your bug-reports
are very useful. More you contribute, more attention you will gain :).

3) AFR only lists files that exist on the first brick listed in the AFR 
configuration.  This can lead to really awkward situations where a file 
doesn't exist on the first brick but does on subsequent bricks.  Now, 
I've been explained that this occurs because AFR does not require a 
metadata server.  In fact, this was one of the draws of gluster to me 
(not having to find some way to make the metadata server highly 
available.)  I did not understand (from any of the documentation 
available) that it's not that gluster doesn't *require* a metadata 
server, it's that it doesn't solve name space problem at all.

AFR uses two phase commit for atomic write operations. For read operations
it load balance across the volumes. What you are asking is to atomic
read/readdir from all the volumes and verify if they are same. We can
implement so, but it will impact the performance.

GlusterFS does not have meta-data server even for file level (distributed
hash) or block level (stripe) distribution.

4) AFR does not work reliably above unity or DHT.  It crashes a lot. 
 Now, I can understand that gluster was not designed to operate in this 
fashion, however, I cannot think of any other way to put live data into 
a gluster file system.  (read this as, it would not be my final config, 
but without having real-time replication of data into my "proper" 
config... I would need to turn off live servers for days if not weeks to 
move the data around by hand.  If I were to move data around by hand, 
why would I need a replicated file system?)  If it were the case that 
gluster is not designed to solve these problems, perhaps that should be 
listed in the documentation somewhere rather than instructions on how to 
do it (perhaps this is already the case with the 1.4 documentation?). 
 Preferably, we could just fix the problems that cause it to not be possible

AFR is very much intended to work with DHT or Unify. We will look into
your bug reports.

Now it's really naive of me to even attempt a design of a working 
system, but if I were to try...

I would break AFR into three code paths

1) WRITE

on write, files are written to all available bricks.  Bricks that are 
not available are queued until they become available again.

2) READ

on read, lookups happen on all bricks.  If a file doesn't exist on a 
particular brick, it is added to the queue for replication.  The file is 
returned from a valid brick.  (this is complicated by a server not being 
available when a delete occurs.  If after it comes back up after a 
deletion and that file is requested, that file would be replicated 
again.)  This would, of course, not scale linearly with the addition of 
bricks.

3) REPLICATION

Process the queue.  I don't know where this queue should exist.  But 
replication ought occur with it's own thread/process independent of 
read/write.  Somewhere into this could be added code to "balance" files 
across bricks (should a certain number of bricks only be required for a 
file.  example: 5 bricks, but only two bricks require the file.)

</rant>

Queuing of writes from multiple clients has lot of coherency issues. It is
a complicated design. We have thought of implementing a spare volume concept
for this purpose. I will discuss with you when time is right.

Is there an automated build process for arch somewhere?  If not, I would 
be willing to build one for the project so that developers would be 
warned of build errors as were introduced and fixed for FreeBSD 
recently.  It would be a convenient place to add unit tests as well.

 Christopher Owen.

Automated build for FreeBSD? We don't even have an inhouse FreeBSD server.
It will be a big help for us.

Thanks a lot. Happy Hacking!
--
Anand Babu

 > Date: Mon, 5 Jan 2009 02:30:29 -0800
 > From: ab@xxxxxxxxxxxxx
 > To: swankier@xxxxxxx
 > CC: list-hacking@xxxxxxxxxxxxx; gluster-users@xxxxxxxxxxx; 
Gluster-devel@xxxxxxxxxx
 > Subject: Re: [List-hacking] [bug #25207] an rm of a file should not 
cause that file to be replicated with afr self-heal.
 >
 > Christopher, main issue with self-heal is its complexity. Handling 
self-healing
 > logic in a non-blocking asynchronous code path is difficult. 
Replicating a missing
 > sounds simple, but holding off a lookup call and initiating a new 
series of calls
 > to heal the file and then resuming back normal operation is tricky. 
Much of the
 > bugs we faced in 1.3 is related to self-heal. We have handled most of 
these cases
 > over a period of time. Self-healing is decent now, but not good 
enough. We feel that
 > it has only complicated the code base. It is hard to test and 
maintain this part of
 > the code base.
 >
 > Plan is to drop self-heal code all together once the active healing 
tool gets ready.
 > Unlike self-healing, this active healing can be run by the user on a 
mounted file system
 > (online) any time. By moving the code out of the file system, into a 
tool (that is
 > synchronous and linear), we can implement sophisticated healing 
techniques.
 >
 > Code is not in the repository yet. Hopefully in a month, it will be 
ready for use.
 > You can simply turn off self-heal and run this utility while the file 
system is mounted.
 >
 > List-hacking is an internal list, mostly junk :). It is an internal 
company list.
 > We don't discuss technical / architectural stuff there. They are 
mostly done over
 > phone and in-person meetings. We do want to actively involve the 
community right
 > from the design phase. Mailing list is cumbersome and slow to 
interactively
 > brainstorm design discussions. We can once in a while organize IRC 
sessions
 > for this purpose.
 >
 > --
 > Anand Babu
 >
 > Swank iest wrote:
 >> Well,
 >>
 >> I guess this is getting outside of the bug. I suppose you are going to
 >> mark it as not going to fix?
 >>
 >> I'm trying to put gluster into production right now, so may I ask:
 >>
 >> 1) What are the current issues with self-heal that require a full
 >> re-write? Is there a place in the Wiki or elsewhere where it's being
 >> documented?
 >> 2) May I see the new code? I must not be looking in the correct place
 >> in TLA?
 >> 3) If it's not written yet, may I be included in the design discussion?
 >> (As I haven't put gluster into production yet, now would be a good time
 >> to know if it's not going to work in the near future.)
 >> 4) May I be placed on the list-hacking@xxxxxxxxxxxxx mailing list, 
please?
 >>
 >> Christopher.
 >>
 >> > Date: Mon, 5 Jan 2009 01:36:14 -0800
 >> > From: ab@xxxxxxxxxxxxx
 >> > To: krishna@xxxxxxxxxxxxx
 >> > CC: swankier@xxxxxxx; list-hacking@xxxxxxxxxxxxx
 >> > Subject: Re: [List-hacking] [bug #25207] an rm of a file should not
 >> cause that file to be replicated with afr self-heal.
 >> >
 >> > Krishna, leave it as is. Once self-heal ensures that the volumes are
 >> intact, rm will
 >> > remove both the copies anyways. It is inefficient, but optimizing it
 >> the current framework
 >> > will be hacky.
 >> >
 >> > Swaniker, We are ditching the current self-healing framework with an
 >> active healing tool.
 >> > We can take care of it then.
 >> >
 >> >
 >> > Krishna Srinivas wrote:
 >> >> The current selfheal logic is built in lookup of a file, lookup is
 >> >> issued just before any file operation on a file. So if the lookup 
call
 >> >> does not know whether an open or rm is going to be done on the file.
 >> >> Will get back to you if we can do anything about this, i.e to 
save the
 >> >> redundant copy of the file when it is going to be rm'ed
 >> >>
 >> >> Krishna
 >> >>
 >> >> On Mon, Jan 5, 2009 at 12:19 PM, swankier <INVALID.NOREPLY@xxxxxxx>
 >> wrote:
 >> >>> Follow-up Comment #2, bug #25207 (project gluster):
 >> >>>
 >> >>> I am:
 >> >>>
 >> >>> 1) delete file from posix system beneath afr on one side
 >> >>> 2) run rm on gluster file system
 >> >>>
 >> >>> file is then replicated followed by deletion
 >> >>>
 >> >>> _______________________________________________________
 >> >>>
 >> >>> Reply to this item at:
 >> >>>
 >> >>> <http://savannah.nongnu.org/bugs/?25207>
 >> >
 >> > --
 >> > Anand Babu Periasamy
 >> > GPG Key ID: 0x62E15A31
 >> > Blog [http://ab.freeshell.org]
 >> > GlusterFS [http://www.gluster.org]
 >> > The GNU Operating System [http://www.gnu.org]
 >> >
 >>
 >> ------------------------------------------------------------------------
 >> Visit messengerbuddies.ca to find out how you could win. Enter today.
 >> <http://www.messengerbuddies.ca/?ocid=BUDDYOMATICENCA20>
 >
 > --
 > Anand Babu Periasamy
 > GPG Key ID: 0x62E15A31
 > Blog [http://ab.freeshell.org]
 > GlusterFS [http://www.gluster.org]
 > The GNU Operating System [http://www.gnu.org]
 >

------------------------------------------------------------------------
Visit messengerbuddies.ca to find out how you could win. Enter today. 
<http://www.messengerbuddies.ca/?ocid=BUDDYOMATICENCA20>

--
Anand Babu Periasamy
GPG Key ID: 0x62E15A31
Blog [http://ab.freeshell.org]
GlusterFS [http://www.gluster.org]
The GNU Operating System [http://www.gnu.org]