Re: trusted.glusterfs.version xattr

Gordan Bobic <gordan@xxxxxxxxxx> · Tue, 06 May 2008 23:59:20 +0100

Kevan Benson wrote:
Gordan Bobic wrote:
I suspect this isn't a problem that can be solved without having a 
proper journal of metadata per directory, so that upon connection, the 
whole journal can be replayed.

You could sort of bodge it and use timestamps as the primary version 
and the xattr version as secondary, bit that is no less dangerous - it 
only takes one machine to be out of sync, and we are again looking at 
massive scope for data loss.

You could bodge the bodge further to work around this by ensuring that 
the nodes are heartbeating current times to sync between them and 
without the sync no data exchange takes place. But that then 
complicates things because what do you do when a node connects and is 
out of sync, but in the future? Who wins on time sync? Who has the 
latest authoritative copy?

I think the most sane way of addressing this is to have a fully logged 
directory metadata journal. But then we are back to the journalling 
for fast updates issue with a journal shadow volume, which is 
non-trivial to implement.

Unless there is some kind of a major mitigating circumstance, it seems 
that between this and the race condition that Martin is talking about 
on the other thread, GlusterFS in it's current is just too dangerous 
to use in most environments that I can think of. And unlike Gareth a 
few days ago, I'm not talking about performance issues - I'm talking 
about scope for data loss in very valid and very common use cases. :'(

Hmm, what about trusted.glusterfs.createtime (epoch time) as a major 
version number, and trusted.glusterfs.version as the minor version 
number.  Couple that with a glusterfs master time node (defaults to lock 
node) and you should have a fairly consistent cluster, right?

There are several problems with this:
1) The concept of the "lock node" is limiting. The locking should be 
distributed.
2) Using creation/modification time as the major number is problematic 
due to time syncing. What happens when the master node goes offline? If 
the nodes are in not in perfect time sync, you've still got the same 
problem.
3) "fairly consistent" is _really_ not good enough when we are talking 
about a file system.

IMO, it would be better to come up with a design that solves the problem 
once and for all. The order of priorities really has to be: consistency, 
reliability, performance.

If that isn't the case, you might as well be using a distributed hash 
table and hope that you'll get most of the data back most of the time.

Gordan