--- Gordan Bobic <gordan@xxxxxxxxxx> wrote: > Martin Fick wrote: > > >> A better solution would be to maintain a list of > >> dirty blocks and use it during selfheal. > > > > Agreed, but why not make it infinitely granular > and > > keep a list of dirty file spans instead of blocks? > > > This should be extremely space efficient. > > Is this complication and extra effort realy worth > the benefit over straight rolling hash rsync > approach? It seems to me that applying the > rsync method at read-time would be a fairly minor > mod that would solve 99% of the problem. No extra > book-keeping would be required, only a > change from copying the whole file to rsyncing the > file. If the rsync solution truly is minor, great, I am all for it! However I do not share your optimism that the rsync method is minor, and I think that the journal method is comparatively very minor and way less error prone and more efficient to boot. There are two parts in common to both solutions and I will claim that both parts are easier with the journal method. The journal method does have an additional third method, the actual journal logging (and cleanup), which I believe is actually very simple. The two parts in common are: 1) determining which parts of files need to be transferred and 2) the protocol extensions to communicate/transfer these parts. For #1 in the journal case, simply consult the journal and you have the answer, extremely simple! In the rsync case you must calculate this. I do not believe that the rsync algorithm could be characterized as simple but I will grant you that at least there is existing code out there that could be used to do it. Even with this benefit, porting this existing solution to glusterfs surely couldn't be easier than looking up extents in a file? As for #2 a common protocol might even be of use for this, allowing a potential for both solutions in the future! This part should be about the same complexity for both solutions, certainly not more complex for the journal method. As for performance, I do not believe that it is even close. Part 2 should be the same. For part 1 the biggest benefit to both methods is achieved on large files. On large files as Garreth has pointed out, the rsync method would mean large disk io/CPU usage on both servers plus a descent amount of network io comparing hashes. The journal, method however would require no CPU, no network io (which is likely to be much more scarce than disk io), and fairly minor server disk io on only one server reading a list of dirty spans/extents from a file (this can be optimized/limited so that it is always less than rsync which must read the whole file). Unless I am missing something, the journal method in the big file case blows away the performance of the rsync method and in every case (including the small file case) is still faster/less resource intensive? This leaves the additional logging/cleanup task of the journaling method. I address your specific logging performance concerns further down in this message. I believe that loggin will have a negligible if even measurable impact on performance if done right since I believe that there are many performance enhancements that could be made to this. For example: only log changes to files above a certain file size, perform asynchronous logging so that it does not impact the normal write path... The disk io for each change is change size independent and is probably only around 24 bytes. If changes are logged without syncing to disk (since failure is acceptable), many seeks can be avoided. Logging could even be memory cached for a while so that it could be potentially deleted before being written to disk (see cleanup below) avoiding any disk io! > Journal per se wouldn't work, because that implies > fixed size and write-ahead logging. What would be > required here is more like the snapshot style undo > logging. No, no need to keep the old data around. We only need to remember the start and span of each changed section along with the file version of the change! This is much esier/space efficient than snapshots. Excuse me for being ignorant of the actual sizes of these three parameters, but they can't be larger than 8 bytes each, can they? 8*3 = 24 bytes. A 100MB journal filesystem could store almost 50 thousand different file changes! > The problem with this is that you have to: > > 1) Categorically establish whether each server is > connected and up to date for the file being > checked, and only log if the server has > disconnected. This involves overhead. Again, no, no need to add overhead to the logging, leave that to the cleanup path. The journal translator (perhaps journal is not the best word, but until a better one is suggested...), can be invisible to the AFR layer during logging. The journal layer can simply log every byte range that is changed along with the version without knowing whether any servers are down. As shown above the disk overhead is minor or optimizeable to be minor which can be a journal layer, not an AFR layer decision. This means that a client side AFR could literally see no overhead for the logging (but potentially some minor overhead for cleanup) As for cleaning up unused versions, this can occur in several efficient methods. The AFR translator could inform the journal of previously successful writes on other nodes efficiently after the fact, in the next (potentially unrelated) message packet (again, we are talking minor bytes here) so that the change can be quickly flushed (potentially before it was ever even written to disk!). Cleaning up no longer needed logs due to a heal can be done during the healing itself. Again, this means a few bytes during already required message packets. All this keeps the logging and cleanup network overhead to almost null, effectively a few more bytes piggy backing on already existing message packets. > 2) For each server that is down at the time, each > other server would have to start writing the > snapshot style undo logs (which would have to be > per server) for all the files being changed. This > effectively multiplies the disk write-traffic by > the number of offline servers on all the working > up to date servers. No need for snapshot logging (see above), thus the small amount of writting needs to occur only once on each up server, not once for each down server. There are no multiplying/scaling issues here. > The problem that arises then is that the fast(er) > resyncs on small changes come at the cost of > massive slowdown in operation when you have > multiple downed servers. As the number of servers > grows, this rapidly stops being a workable > solution. No snapshot assumption, no massive slowdown. One write per up server no matter how many down servers. This scales nicely since the writes are all on separate servers. I realize I may not convince you of all of this, and that you guys have probably spent a lot of time thinking about this and that there are surely other issues which I have not thought of. Are there any other known/perceived issues? In spite of my pigheadedness and refusal to drop the issue easily, I appreciate that you are taking the time to discuss potential problems with what I believe would be a good solution. As a point of reference, surely since other projects such as DRBD have implemented similar logging solutions (and not the rsync solution), they at least must believe it to be superior. :) Although, I would argue that drbd could be easily modified and would greatly benefit of the use rsync itself when it needs to do a full sync! Perhaps I will even suggest that do the drbd list. :) Thanks, -Martin P.S. While I believe the logging (without cleanup or re-sync modifications) part will be negligible, and since I believe that this is actually the easiest part to implement, it could easily be prototyped to determine its actual impact! Would you be convinced if these numbers turned out to be negligible? :) ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ