Hello GlusterFS developers.
We have found that current DHT translator is suboptimal: the number
of files being moved during re-balancing (caused by small changes in
the set of bricks) can be significantly reduced (see Appendix).
To be precise, we can improve scalability: in the current DHT the amount
of re-balancing work scales as O(M) (M is total number of files in the
compound volume), whereas after changing the hashing technique it will
scale as O(M/N) (N is the number of bricks).
In the document below we first consider simple tables (section 1) and
estimate minimal amount of rebalance work for them. Then we complicate
them with techniques of virtual nodes (section 2) and replication
(section 3), and show that this doesn't worse scalability.
Unfortunately it is impossible to perform the improvements without
format change, so it would be a new translator, which won't understand
layouts created by current DHT (and back).
We will be happy to see any feedbacks on this, and if everything is
OK, to proceed development in this direction (with implementation
details, etc).
Thanks,
Edward.
Legend:
C - namespace;
R - 64-bit ring;
N - number of real bricks that compose the volume;
M - number of files in the compound volume;
S - number of virtual components of any real brick;
R - size of preference set (replication level);
1. Simple DH tables based on consistent hashing
We consider a 64-bit ring R, i.e. a regular 2^64-polygon with vertexes
0, ..., 2^64 - 1. "Ring" means that for every vertex A there can be
found vertexes B, C of R, so that C < A < B. Here "<" and "<=" mean
relations between respective angles (for any vertex Z we consider the
angle composed of O0 and OZ, where O is the center of the polygon).
Then we consider namespace C and any mapping phi: C -> R, so that for
every sequence {c1, c2, ... } of different names {phi(c1), phi(c2),
...) is a pseudo-random variable, which has uniform distribution on R
(see 0*).
Suppose we have a volume composed of N bricks with unique names
B1, B2, ... B_N. During system initialization we construct for this
compound volume N respective unique tokens phi(B1), ..., phi(B_N)
in the ring R, caching them, say, in rb-tree to preserve ordering.
When creating a directory (mkdir(2)) we create respective directories
on all bricks B_1, B_2, ... B_N.
When creating a regular file (creat(2), etc) with a short name F,
we create a respective file only in one brick B = B(F), which is
determined by the following way: phi(B) is the minimal token in the
ring so that phi(F) <= phi(B). This is where the the notion of ring
works: if there is no any such token in the [phi(F), 2^64 - 1], then
we continue search from the vertex 0.
Lookup operation for any regular file F resolves to lookup(F) on the
respective brick B(F).
Deleting a regular file F resolves to deleting a file from the brick
B(F).
Looking for a brick (i.e. calculation F->B(F)) is a well-scalable
operation: log(N) actions is required.
When adding a new brick X = B_(N+1) we set a new unique token phi(X)
to the ring R and move a subset of files from B_j to X, where B_j is
the brick with the smallest phi(B_j), so that phi(X) < phi(B_j).
Namely, every file W of B_j with phi(W) <= phi(X) should be moved to X.
That said, the number of files to be moved during re-balancing is not
larger than a number of files contained in one brick (B_j in our case)
When removing a brick Y = B_k, we first find in R the "closest" brick
B_s, which has minimal phi(B_s), so that phi(Y) < phi(B_s), and move
all files from Y to B_s (no scans is required).
Such hashing technique is called "consistent hashing associated with
the variable phi, which has uniform distribution". This is a
relatively new technique suggested by Karger at al (1*). The main
advantage of this technique is that small changes in the set of bricks
result in small amount of rebalancing work. Namely, adding/removing 1
brick results in moving of only M/N files (2*) (M is total number of
files in the compound volume). This is M/(M/N) = N times better then
with traditional hashing, where we need to move all M files (median
value). In particular, if we have 100 bricks, then with traditional
hashing we'll need to move x100 files more than with consistent one.
To construct a uniform distribution phi we can have any well-mixing
64-hash, say fnv-hash, etc..
Comment 1. The technique of consistent hashing is used in Amazon's
Dynamo (4*)
Comment 2. There is a disadvantage: in this approach all files
/foo
/dir1/foo
/dir1/dir2/foo
...
will be accumulated on the same brick. However it is possible to
"salt" a short file names with gfid (or another id) of respective
directory before hashing, to avoid possible attacks.
2. Virtual nodes
The theory above works well for larger number of bricks N. However,
when N is too small (2-3 bricks), then uniform distribution can result
in bad partitioning, so one brick will accumulate much more files then
other ones, which is not good. The graceful solution of this problem
is so-called technique of "virtual nodes": with every brick we set S
tokens on the ring, where S is a number of "virtual components" of a
brick. So, every brick is represented by S unique tokens on the ring
(S >= 1, S is a parameter of the cluster translator).
In the case of S>1 the lookup-a-brick procedure above is not changed:
the difference is that we search in a larger set of tokens (N*S), and
since log(N*S) == log(N) + log(S) == log(N) + const, this search also
scales as log(N), while with a larger number of tokens the
distribution of files becomes more balanced (in terms of the standard
deviation, see (3*) and the Appendix for details. In particular, S=100
provides deviation ~10% of the mean.
Adding a new brick with S > 1 looks like above with the difference
that we steal files of S > 1 different old virtual bricks. Note, that
2 or more of those virtual bricks can represent the same real brick
though. So adding a real brick with S virtual components requires
(M/N)*S scans, however, a median number of files to be moved during
re-balancing is the same (M/N) as in the case of S==1.
Removing a brick with S > 1 virtual components mostly looks like in
the case of S == 1: no scans is requires. The difference is that we
distribute files of the brick to be removed among S virtual bricks
(which correspond to <= S real bricks).
3. Replication and Recovery
To achieve high availability and durability we replicate files on
multiple bricks. In our case replication can be implemented as a set
of operations with the same ring R, so we don't create a separate
translator for replication.
We store every file in its so-called preference set of real bricks.
Ordinal number R of this set is the volume option. R is also called
replication level (R = 1 means no replication: every file is stored
only in a single brick).
For every file F its preference set is defined as a set of "closest"
virtual bricks B_(k_1), ... , B(k_R), which represent pairwise
different real bricks, so that B_(k_1) = B(F), and
phi(B_(k_1)) < phi(B_(k_2)) < ... < phi(B_(k_R)).
We don't create 2 replicas of the same file on the same real brick,
so, R shouldn't be larger than N.
If we enable replication (R>1), regular file operations become more
complicated: every such operation is performed for all respective
files located on all bricks of the preference set.
Operations on a set of bricks also become more complicated, but
scalability doesn't suffer. When adding a new brick X = B_(N+1), we
0) set a unique token phi(X) to the ring R.
1) find R closest tokens B_(m_1), ..., B_(m_R), which represent
different real bricks in the ring, so that B_(m_R) == X,
and phi(B_(m_1)) < ... < phi(B_(m_R)).
2) find R+1 closest tokens B_(p_0), ..., B_(p_R), which represent
different real bricks in the ring, so that B_(p_0) == X,
and phi(B_(p_0)) < ... < phi(B_(p_R)).
3) The new brick X steals a portion of files of B_(p_1) as it has been
described in section (1) above.
4) The brick B_(p_R) becomes not belonging to the preference set of
the stolen files, so we need to remove all the respective replicas
from B_(p_R).
5) X becomes a brick belonging to the preference sets of files stored
in the bricks B_(m_1),... , B_(m_(R-1)), hence we should create
respective replicas on X.
So adding a new brick with replication level R > 1 results in
a) moving a portion of files of one brick (step (3) above);
b) replication of files located on on R-1 bricks (step (5) above);
c) deleting replicas of a portion of files of one brick (step (4)).
(a),(b),(c) above can be considered as re-balancing of (R+1)*(M/N) =
const*(M/N) files (when R == 1, then (b),(c) are absent, and we need
to re-balance M/N files, as it was shown in the section 1 above).
Similarly we can show that with level of replication R removing one
brick also leads to re-balancing of const*(M/N) files.
If in our configuration L < R bricks don't respond for some reasons,
then all regular file operations are still defined, however our system
is marked as "unhealthy" (in some papers this state is called "sloppy
quorum"), non-responding bricks are marked as "missed" in the ring and
file operation are performed on other available bricks of the
preference set. In such operations files on the available "non-
primary" bricks are marked as "not synced with the missed replicas".
In the state of sloppy quorum operations like add/remove a node can
be already undefined. For example, when adding a brick we'll need to
steal files from a brick, which doesn't respond.
So we need to return the system back to a "consistent" state, when all
operations are defined. It can be done by the following ways:
1) Make sure that all missed bricks are available again and perform
L1-recovery. L1-recovery means syncing all marked files with the
again available bricks, so that resulting consistent system will
have the same number N of bricks.
2) Add new empty bricks instead of missed ones and perform L2-recovery
It means filling the new empty bricks with files from other bricks,
so that resulting consistent system will have the same number N of
bricks.
3) Remove missed bricks from the ring and perform L3-recovery, so that
resulting consistent system will have smaller number of nodes (N-L)
Comment 1. For any missed brick we can specify different type of
recovery.
Comment 2. When R == N replication "clogs" the distribution: in this
case our system will work like mirrors: every bricks will contain
the same set of files.
APPENDIX
----------------------------------------------------------------------
In 3 distributed hash tables with different hashing techniques
. GlusterFS DHT translator (3.2.5)
. 64-bit ring with phi based on md5, R=1 (no replication), S=7
. 64-bit ring with phi based on md5, R=1 (no replication), S=20
we run the same scenario:
1) Create 100 files ("file00", "file01", ..., "file99") in a volume
composed of 9 bricks:
"host:/root/exp0",
"host:/root/exp1",
...
"host:/root/exp8".
2) Add one brick "host:/root/exp9";
3) re-balance;
I. GlusterFS DHT translator (3.2.5)
----------------------------------------------------------------------
before re-balancing:
exp0: file15 file22 file34 file35 file51 file6 file68 file78
file8 file81 file89 file94 file95
exp1: file10 file28 file3 file4 file43 file66 file75
exp2: file40 file46 file47 file48 file50 file58 file86 file9
exp3: file12 file13 file32 file37 file54 file55 file7 file71
file91
exp4: file31 file38 file41 file42 file53 file62 file63 file69
file93 file97
exp5: file11 file16 file17 file24 file25 file27 file29 file44
file56 file73 file74 file80 file87 file90
exp6: file0 file1 file2 file33 file36 file49 file57 file59
file64 file77 file79 file84 file85 file88 file98
exp7: file21 file26 file39 file52 file61 file70 file72 file83
file92 file99
exp8: file14 file20 file23 file30 file45 file5 file60 file65
file67 file76 file82 file96
after re-balancing:
exp0: file11 file16 file17 file24 file25 file31 file44 file53
file62 file69 file73 file80 file87 file93 file97
exp1: file0 file27 file29 file33 file36 file56 file57 file64
file74 file77 file84 file88 file90 file98
exp2: file1 file2 file39 file49 file59 file72 file79 file85
file92
exp3: file21 file26 file30 file45 file52 file60 file61 file65
file70 file83 file99
exp4: file14 file20 file23 file5 file67 file76 file82 file96
exp5: file15 file22 file34 file35 file51 file6 file68 file78
file8 file81 file89 file94 file95
exp6: file10 file28 file4 file43 file66 file75
exp7: file3 file40 file47 file58 file9
exp8: file12 file13 file32 file37 file46 file48 file50 file7
file71 file86
exp9: file38 file41 file42 file54 file55 file63 file91
as the result 98 files have been rebalanced (total files scanned 139)
II. 64-bit ring with phi based on md5.
Every brick has number of virtual components S=7
----------------------------------------------------------------------
before re-balancing:
exp0: file02 file18 file22 file42 file48 file58 file62 file70
exp1: file01 file08 file15 file23 file33 file51 file64 file75
file82 file85 file86 file87 file95
exp2: file00 file10 file11 file14 file25 file29 file40 file45
file63 file81 file91 file96
exp3: file09 file16 file19 file21 file24 file28 file32 file35
file36 file44 file47 file50 file52 file57 file67 file73
file88 file98
exp4: file27 file49 file53 file55 file69 file97
exp5: file05 file20 file43
exp6: file34 file68 file72 file74 file79
exp7: file03 file04 file26 file39 file41 file54 file60 file71
file77 file78 file83 file84 file89 file93 file94
exp8: file06 file07 file12 file17 file30 file31 file37 file38
file46 file56 file59 file61 file65 file66 file76 file80
file90 file92 file99
after re-balancing:
only the following files has been moved (to exp9):
exp0: file70
exp1: file82 file85
exp2: file00 file14 file25 file40 file45 file91 file96
exp3: file88
as the result 11 files have been rebalanced (total files scanned 51)
III. 64-bit ring with phi based on md5.
Every brick has number of virtual components S=20
-----------------------------------------------------------------------
before re-balancing:
exp0: file00 file04 file22 file30 file40 file42 file70 file96
exp1: file06 file08 file13 file15 file17 file19 file23 file32
file33 file78 file81 file86 file95
exp2: file11 file14 file16 file24 file29 file57 file63 file67
file73 file76
exp3: file02 file10 file12 file18 file21 file28 file35 file51
file56 file59 file80 file87
exp4: file39 file41 file49 file53 file54 file62 file69 file77
file83 file84 file91
exp5: file05 file20 file31 file43 file68 file74
exp6: file34 file37 file38 file46 file48 file58 file66 file71
file75 file79 file88
exp7: file03 file07 file26 file47 file60 file72 file89 file93
file94
exp8: file01 file09 file25 file27 file36 file44 file45 file50
file52 file55 file59 file61 file64 file65 file82 file85
file90 file92 file97 file98 file99
after re-balancing:
only the following files has been moved (to exp9):
exp0: file70
exp6: file88
exp7: file07 file93
exp8: file45 file82 file85
as the result 7 files have been rebalanced (total files scanned 48)
----------------------------------------------------------------------
Table with results
Re-balance results and standard deviations (sd)
for current gluster DHT translator (3.2.5), and
for 64-bit rings with S=7 and S=20.
----------------------------------------------------------------------
DHT(Gluster 3.2.5) 64-bit ring, S=7 64-bit ring, S=20
----------------------------------------------------------------------
files moved 98 11 7
files scanned 139 51 48
sd before 2.8 5.4 3.7
sd after 3.2 5.3 3.2
----------------------------------------------------------------------
----------
(0*) http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29
(1*) Karger, D.; Lehman, E.; Leighton, T.; Panigrahy, R.; Levine, M.;
Lewin, D.
(1997). "Consistent Hashing and Random Trees: Distributed Caching
Protocols
for Relieving Hot Spots on the World Wide Web"
(2*) M/N is a median value
(3*)
http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html
(4*) http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html