On Thu, 1 May 2008, Martin Fick wrote:
Sounds like a good idea. The next question is where
to keep the log. 1 log per file? 1 log per
directory?
How to store them? Shadow files? Separate
shadow volume? A shadow volume might be a good idea
because it keeps the main source mounted directory
exactly the same as a normal directory.
I would start as simple as possible and adapt as
necessary if you run into a performance problem. The
simplest design would probably be a shadow volume with
one log per file with the a sparse mirrored directory
structure.
Indeed, that's exactly what I was thinking. You would effectively need a
container, like a namespace, to unify the two.
Logs could be 24(?) bytes concatenated one
after another making appending easy and reliable. Or
at a minor space cost (but potential added
portability/extendability), each log file could even
be a colon delimited line based ascii file (please
don't anyone suggest an xml file!)
version1:start2:span2
version2:start2:span2
...
If it's fixed length pointers (or in fact fixed length records), I'd go
with packed binary format for efficiency and speed. These will have to be
written to on every write. There would also need to be a header that
states where the roll-over point is. Effectively, the log would be an RRD.
Having a separate log file for each real file also
makes it easy to code up some optimizations, for
example: it would be easy to lookup the size of the
log and the size of the real file. As soon as the log
becomes bigger than the real file it is no longer
worth keeping as is! It also makes it real easy to
just delete the log if the real file is deleted.
Maybe have the default log be about 0.5% of the file in powers of 2, and
not used for files below a certain size. Maybe grow/shrink it when it
would exceed one step in powers of 2 from it's intended 0.5% size. This
would mean that as the file grows, the log increases, but the log
extention gets exponentially more rare. log truncation could be left until
a suitable roll-over point. If you are syncing inodes, then that is
typically 4KB, and a log entry would be, as you said, 24 bytes. That makes
a log entry for a changed inode block about 1/170, which is about right
for the 0.5% ball park.
Another nice optimizer could make intelligent
decisions about which log files to delete when the
shadow volume starts to fill up. By simply examining
the size of each log versus the size of the real file
one can set an upper bounds on how much transfer data
the log could be saving (a real estimate would require
adding all the spans together in the log file taking
into account overlapping sections).
Sure - if you want to keep volumes separate. Or you could just maue sure
that your log volume is always at least 1/170 of the data volume it's
shadowing. Possibly a bit more for a safety margin with the lazy log
resizing - around 1% ought to suffice for most sane cases.
Finally, it would
allow an admin to prune the shadow volume manually of
whichever logs he chooses to prune. An ascii file
would make it easy to script various pruners.
I think that starts getting potentially dangerous. I think just having the
logs volume at about 1% of the data volume would be better. Of course, if
you keep both on the same physical volume, it won't matter.
It would be nice to design the shadow volume so that
it can be removed from the picture at any time without
corrupting anything.
You already covered that with the sparse shadow volume tree. If there's no
log, you resync the whole file.
It would also be nice to ensure
that the journal translator can handle an out of space
condition. This way each server is not required to
even have the same size journal volume if any at all.
Note that this gets into the chicken and the egg problem - the log files
would still need to be syncable directly using the current method - or
you'd need a journal for your journalling volume. But if the journal is
typically < 1% of the file, that's probably cheap enough that it won't
matter too much. You could also probably set the upper limit on the volume
size, because past a certain point the file changes will be limited by
disk speeds, so from there on a bigger file doesn't imply more log space
is required.
A (shadow volume) log should, ideally, also keep
additional sanity check information such as file
metadata (timestamps, size) for cross-check of
whether something went weird and the file was
changed underneath GlusterFS, and if it has, flush
out the log and force a full resync on the file.
Hmm, this seems like an additional layer that might be
nice (and perhaps an XML log would be appropriate
here), but I would put it an separate inline
translator so that it is not required. The nice part
is that if the protocol is extended to handle the
journal layer, adding another separate layer like this
would probably be easy!
For the sake of an extra few bytes in the log entry (8 byte time stamp + 8
byte file size), I think it is probably worthwhile having it for
crosscheck.
Thanks again for your patience, I know it's not easy
listening to back seat designers :)
I second that apology. :-)
Gordan