On Wed, 7 May 2008, Krishna Srinivas wrote:
I suspect this isn't a problem that can be solved without having a
proper journal of metadata per directory, so that upon connection,
the whole journal can be replayed.
You could sort of bodge it and use timestamps as the primary version and
the xattr version as secondary, bit that is no less dangerous - it
only takes one machine to be out of sync, and we are again looking at
massive scope for data loss.
You could bodge the bodge further to work around this by ensuring that
the nodes are heartbeating current times to sync between them and
without the sync no data exchange takes place. But that then
complicates things because what do you do when a node connects and is
out of sync, but in the future? Who wins on time sync? Who has the
latest authoritative copy?
I think the most sane way of addressing this is to have a fully logged
directory metadata journal. But then we are back to the journalling
for fast updates issue with a journal shadow volume, which is
non-trivial to implement.
Unless there is some kind of a major mitigating circumstance, it seems
that between this and the race condition that Martin is talking about
on the other thread, GlusterFS in it's current is just too dangerous
to use in most environments that I can think of. And unlike Gareth a
few days ago, I'm not talking about performance issues - I'm talking
about scope for data loss in very valid and very common use cases.
:'(
Hmm, what about trusted.glusterfs.createtime (epoch time) as a major
version number, and trusted.glusterfs.version as the minor version
number. Couple that with a glusterfs master time node (defaults to
lock node) and you should have a fairly consistent cluster, right?
There are several problems with this:
1) The concept of the "lock node" is limiting. The locking should be
distributed.
2) Using creation/modification time as the major number is problematic due
to time syncing. What happens when the master node goes offline? If the
nodes are in not in perfect time sync, you've still got the same problem.
Correct, if machines running afrs are not time sync, it can cause problems.
We were thinking of using parent's directories version as the file's
createtime attribute. We increment the parent dir version first then
create the file and apply parent's version as the file's createtime.
Any thought on this?
Hmm... File-replacement-clobber issue only arises when a file is
removed and another file with the same name created. It's creation version
that's the problem. So yes, I think bumping up directory version on very
file (a subdirectory also being a file) create/delete operation, and using
the directory version at creation time as the major version number, with
standard file version number as it is at the moment being the minor
version number would work.
Good idea, and considerably simpler than what I was thinking about with
the journalling. :-)
Gordan