[PATCH 21/8] cleanup flush.c part 2

Dushan Tcholich <dusanc@xxxxxxxxx> · Wed, 24 Oct 2007 11:56:11 +0200

flush.c now on only total: 3 errors, 2 warnings, 3704 lines checked
Signed-off-by: Dushan Tcholich <dusanc@xxxxxxxxx>

--- flush.c.orig	2007-10-24 11:02:28.000000000 +0200
+++ flush.c	2007-10-24 11:53:21.000000000 +0200
@@ -138,242 +138,261 @@
 */
 
 /* FLUSH_PREP_ONCE_PER_TRANSACTION: Within a single transaction a node is never
-   flushprepped twice (unless an explicit call to flush_unprep is made as described in
-   detail below).  For example a node is dirtied, allocated, and then early-flushed to
-   disk and set clean.  Before the transaction commits, the page is dirtied again and, due
-   to memory pressure, the node is flushed again.  The flush algorithm will not relocate
-   the node to a new disk location, it will simply write it to the same, previously
-   relocated position again.
+   flushprepped twice (unless an explicit call to flush_unprep is made as
+   described in detail below). For example a node is dirtied, allocated, and
+   then early-flushed to disk and set clean. Before the transaction commits, the
+   page is dirtied again and, due to memory pressure, the node is flushed again.
+   The flush algorithm will not relocate the node to a new disk location, it
+   will simply write it to the same, previously relocated position again.
 */
 
-/* THE BOTTOM-UP VS. TOP-DOWN ISSUE: This code implements a bottom-up algorithm where we
-   start at a leaf node and allocate in parent-first order by iterating to the right.  At
-   each step of the iteration, we check for the right neighbor.  Before advancing to the
-   right neighbor, we check if the current position and the right neighbor share the same
-   parent.  If they do not share the same parent, the parent is allocated before the right
-   neighbor.
-
-   This process goes recursively up the tree and squeeze nodes level by level as long as
-   the right neighbor and the current position have different parents, then it allocates
-   the right-neighbors-with-different-parents on the way back down.  This process is
-   described in more detail in flush_squalloc_changed_ancestor and the recursive function
-   squalloc_one_changed_ancestor.  But the purpose here is not to discuss the
-   specifics of the bottom-up approach as it is to contrast the bottom-up and top-down
-   approaches.
-
-   The top-down algorithm was implemented earlier (April-May 2002).  In the top-down
-   approach, we find a starting point by scanning left along each level past dirty nodes,
-   then going up and repeating the process until the left node and the parent node are
-   clean.  We then perform a parent-first traversal from the starting point, which makes
-   allocating in parent-first order trivial.  After one subtree has been allocated in this
-   manner, we move to the right, try moving upward, then repeat the parent-first
-   traversal.
-
-   Both approaches have problems that need to be addressed.  Both are approximately the
-   same amount of code, but the bottom-up approach has advantages in the order it acquires
-   locks which, at the very least, make it the better approach.  At first glance each one
-   makes the other one look simpler, so it is important to remember a few of the problems
-   with each one.
-
-   Main problem with the top-down approach: When you encounter a clean child during the
-   parent-first traversal, what do you do?  You would like to avoid searching through a
-   large tree of nodes just to find a few dirty leaves at the bottom, and there is not an
-   obvious solution.  One of the advantages of the top-down approach is that during the
-   parent-first traversal you check every child of a parent to see if it is dirty.  In
-   this way, the top-down approach easily handles the main problem of the bottom-up
-   approach: unallocated children.
-
-   The unallocated children problem is that before writing a node to disk we must make
-   sure that all of its children are allocated.  Otherwise, the writing the node means
-   extra I/O because the node will have to be written again when the child is finally
-   allocated.
-
-   WE HAVE NOT YET ELIMINATED THE UNALLOCATED CHILDREN PROBLEM.  Except for bugs, this
-   should not cause any file system corruption, it only degrades I/O performance because a
-   node may be written when it is sure to be written at least one more time in the same
-   transaction when the remaining children are allocated.  What follows is a description
-   of how we will solve the problem.
+/* THE BOTTOM-UP VS. TOP-DOWN ISSUE: This code implements a bottom-up algorithm
+   where we start at a leaf node and allocate in parent-first order by iterating
+   to the right. At each step of the iteration, we check for the right neighbor.
+   Before advancing to the right neighbor, we check if the current position and
+   the right neighbor share the same parent. If they do not share the same
+   parent, the parent is allocated before the right neighbor.
+
+   This process goes recursively up the tree and squeeze nodes level by level as
+   long as the right neighbor and the current position have different parents,
+   then it allocates the right-neighbors-with-different-parents on the way back
+   down. This process is described in more detail in
+   flush_squalloc_changed_ancestor and the recursive function
+   squalloc_one_changed_ancestor. But the purpose here is not to discuss the
+   specifics of the bottom-up approach as it is to contrast the bottom-up and
+   top-down approaches.
+
+   The top-down algorithm was implemented earlier (April-May 2002). In the
+   top-down approach, we find a starting point by scanning left along each level
+   past dirty nodes, then going up and repeating the process until the left node
+   and the parent node are clean. We then perform a parent-first traversal from
+   the starting point, which makes allocating in parent-first order trivial.
+   After one subtree has been allocated in this manner, we move to the right,
+   try moving upward, then repeat the parent-first traversal.
+
+   Both approaches have problems that need to be addressed. Both are
+   approximately the same amount of code, but the bottom-up approach has
+   advantages in the order it acquires locks which, at the very least, make it
+   the better approach. At first glance each one makes the other one look
+   simpler, so it is important to remember a few of the problems with each one.
+
+   Main problem with the top-down approach: When you encounter a clean child
+   during the parent-first traversal, what do you do? You would like to avoid
+   searching through a large tree of nodes just to find a few dirty leaves at
+   the bottom, and there is not an obvious solution. One of the advantages of
+   the top-down approach is that during the parent-first traversal you check
+   every child of a parent to see if it is dirty. In this way, the top-down
+   approach easily handles the main problem of the bottom-up approach:
+   unallocated children.
+
+   The unallocated children problem is that before writing a node to disk we
+   must make sure that all of its children are allocated. Otherwise, the writing
+   the node means extra I/O because the node will have to be written again when
+   the child is finally allocated.
+
+   WE HAVE NOT YET ELIMINATED THE UNALLOCATED CHILDREN PROBLEM. Except for bugs,
+   this should not cause any file system corruption, it only degrades I/O
+   performance because a node may be written when it is sure to be written at
+   least one more time in the same transaction when the remaining children are
+   allocated. What follows is a description of how we will solve the problem.
 */
 
-/* HANDLING UNALLOCATED CHILDREN: During flush we may allocate a parent node then,
-   proceeding in parent first order, allocate some of its left-children, then encounter a
-   clean child in the middle of the parent.  We do not allocate the clean child, but there
-   may remain unallocated (dirty) children to the right of the clean child.  If we were to
-   stop flushing at this moment and write everything to disk, the parent might still
-   contain unallocated children.
-
-   We could try to allocate all the descendents of every node that we allocate, but this
-   is not necessary.  Doing so could result in allocating the entire tree: if the root
-   node is allocated then every unallocated node would have to be allocated before
-   flushing.  Actually, we do not have to write a node just because we allocate it.  It is
-   possible to allocate but not write a node during flush, when it still has unallocated
-   children.  However, this approach is probably not optimal for the following reason.
-
-   The flush algorithm is designed to allocate nodes in parent-first order in an attempt
-   to optimize reads that occur in the same order.  Thus we are read-optimizing for a
-   left-to-right scan through all the leaves in the system, and we are hoping to
-   write-optimize at the same time because those nodes will be written together in batch.
-   What happens, however, if we assign a block number to a node in its read-optimized
-   order but then avoid writing it because it has unallocated children?  In that
-   situation, we lose out on the write-optimization aspect because a node will have to be
-   written again to the its location on the device, later, which likely means seeking back
-   to that location.
+/* HANDLING UNALLOCATED CHILDREN: During flush we may allocate a parent node,
+   then proceeding in parent first order, allocate some of its left-children,
+   then encounter a clean child in the middle of the parent. We do not allocate
+   the clean child, but there may remain unallocated (dirty) children to the
+   right of the clean child. If we were to stop flushing at this moment and
+   write everything to disk, the parent might still contain unallocated
+   children.
+
+   We could try to allocate all the descendents of every node that we allocate,
+   but this is not necessary. Doing so could result in allocating the entire
+   tree: if the root node is allocated then every unallocated node would have to
+   be allocated before flushing. Actually, we do not have to write a node just
+   because we allocate it. It is possible to allocate but not write a node
+   during flush, when it still has unallocated children. However, this approach
+   is probably not optimal for the following reason.
+
+   The flush algorithm is designed to allocate nodes in parent-first order in an
+   attempt to optimize reads that occur in the same order. Thus we are
+   read-optimizing for a left-to-right scan through all the leaves in the
+   system, and we are hoping to write-optimize at the same time because those
+   nodes will be written together in batch. What happens, however, if we assign
+   a block number to a node in its read-optimized order but then avoid writing
+   it because it has unallocated children? In that situation, we lose out on the
+   write-optimization aspect because a node will have to be written again to the
+   its location on the device, later, which likely means seeking back to that
+   location.
 
    So there are tradeoffs. We can choose either:
 
    A. Allocate all unallocated children to preserve both write-optimization and
-   read-optimization, but this is not always desirable because it may mean having to
-   allocate and flush very many nodes at once.
+   read-optimization, but this is not always desirable because it may mean
+   having to allocate and flush very many nodes at once.
 
-   B. Defer writing nodes with unallocated children, keep their read-optimized locations,
-   but sacrifice write-optimization because those nodes will be written again.
-
-   C. Defer writing nodes with unallocated children, but do not keep their read-optimized
-   locations.  Instead, choose to write-optimize them later, when they are written.  To
-   facilitate this, we "undo" the read-optimized allocation that was given to the node so
-   that later it can be write-optimized, thus "unpreparing" the flush decision.  This is a
-   case where we disturb the FLUSH_PREP_ONCE_PER_TRANSACTION rule described above.  By a
-   call to flush_unprep() we will: if the node was wandered, unset the JNODE_OVRWR bit;
-   if the node was relocated, unset the JNODE_RELOC bit, non-deferred-deallocate its block
-   location, and set the JNODE_CREATED bit, effectively setting the node back to an
-   unallocated state.
-
-   We will take the following approach in v4.0: for twig nodes we will always finish
-   allocating unallocated children (A).  For nodes with (level > TWIG) we will defer
-   writing and choose write-optimization (C).
-
-   To summarize, there are several parts to a solution that avoids the problem with
-   unallocated children:
-
-   FIXME-ZAM: Still no one approach is implemented to eliminate the "UNALLOCATED CHILDREN"
-   problem because there was an experiment which was done showed that we have 1-2 nodes
-   with unallocated children for thousands of written nodes.  The experiment was simple
-   like coping / deletion of linux kernel sources.  However the problem can arise in more
-   complex tests.  I think we have jnode_io_hook to insert a check for unallocated
-   children and see what kind of problem we have.
-
-   1. When flush reaches a stopping point (e.g., a clean node), it should continue calling
-   squeeze-and-allocate on any remaining unallocated children.  FIXME: Difficulty to
-   implement: should be simple -- amounts to adding a while loop to jnode_flush, see
-   comments in that function.
-
-   2. When flush reaches flush_empty_queue(), some of the (level > TWIG) nodes may still
-   have unallocated children.  If the twig level has unallocated children it is an
-   assertion failure.  If a higher-level node has unallocated children, then it should be
-   explicitly de-allocated by a call to flush_unprep().  FIXME: Difficulty to implement:
-   should be simple.
-
-   3. (CPU-Optimization) Checking whether a node has unallocated children may consume more
-   CPU cycles than we would like, and it is possible (but medium complexity) to optimize
-   this somewhat in the case where large sub-trees are flushed.  The following observation
-   helps: if both the left- and right-neighbor of a node are processed by the flush
-   algorithm then the node itself is guaranteed to have all of its children allocated.
-   However, the cost of this check may not be so expensive after all: it is not needed for
-   leaves and flush can guarantee this property for twigs.  That leaves only (level >
-   TWIG) nodes that have to be checked, so this optimization only helps if at least three
-   (level > TWIG) nodes are flushed in one pass, and the savings will be very small unless
-   there are many more (level > TWIG) nodes.  But if there are many (level > TWIG) nodes
-   then the number of blocks being written will be very large, so the savings may be
-   insignificant.  That said, the idea is to maintain both the left and right edges of
-   nodes that are processed in flush.  When flush_empty_queue() is called, a relatively
-   simple test will tell whether the (level > TWIG) node is on the edge.  If it is on the
-   edge, the slow check is necessary, but if it is in the interior then it can be assumed
-   to have all of its children allocated.  FIXME: medium complexity to implement, but
-   simple to verify given that we must have a slow check anyway.
-
-   4. (Optional) This part is optional, not for v4.0--flush should work independently of
-   whether this option is used or not.  Called RAPID_SCAN, the idea is to amend the
-   left-scan operation to take unallocated children into account.  Normally, the left-scan
-   operation goes left as long as adjacent nodes are dirty up until some large maximum
-   value (FLUSH_SCAN_MAXNODES) at which point it stops and begins flushing.  But scan-left
-   may stop at a position where there are unallocated children to the left with the same
-   parent.  When RAPID_SCAN is enabled, the ordinary scan-left operation stops after
-   FLUSH_RELOCATE_THRESHOLD, which is much smaller than FLUSH_SCAN_MAXNODES, then procedes
-   with a rapid scan.  The rapid scan skips all the interior children of a node--if the
-   leftmost child of a twig is dirty, check its left neighbor (the rightmost child of the
-   twig to the left).  If the left neighbor of the leftmost child is also dirty, then
-   continue the scan at the left twig and repeat.  This option will cause flush to
-   allocate more twigs in a single pass, but it also has the potential to write many more
-   nodes than would otherwise be written without the RAPID_SCAN option.  RAPID_SCAN
-   was partially implemented, code removed August 12, 2002 by JMACD.
+   B. Defer writing nodes with unallocated children, keep their read-optimized
+   locations, but sacrifice write-optimization because those nodes will be
+   written again.
+
+   C. Defer writing nodes with unallocated children, but do not keep their
+   read-optimized locations. Instead, choose to write-optimize them later, when
+   they are written. To facilitate this, we "undo" the read-optimized allocation
+   that was given to the node so that later it can be write-optimized, thus
+   "unpreparing" the flush decision. This is a case where we disturb the
+   FLUSH_PREP_ONCE_PER_TRANSACTION rule described above. By a call to
+   flush_unprep() we will: if the node was wandered, unset the JNODE_OVRWR bit;
+   if the node was relocated, unset the JNODE_RELOC bit, non-deferred-deallocate
+   its block location, and set the JNODE_CREATED bit, effectively setting the
+   node back to an unallocated state.
+
+   We will take the following approach in v4.0: for twig nodes we will always
+   finish allocating unallocated children (A).  For nodes with (level > TWIG)
+   we will defer writing and choose write-optimization (C).
+
+   To summarize, there are several parts to a solution that avoids the problem
+   with unallocated children:
+
+   FIXME-ZAM: Still no one approach is implemented to eliminate the
+   "UNALLOCATED CHILDREN" problem because there was an experiment which was done
+   showed that we have 1-2 nodes with unallocated children for thousands of
+   written nodes. The experiment was simple like coping/deletion of linux kernel
+   sources. However the problem can arise in more complex tests. I think we have
+   jnode_io_hook to insert a check for unallocated children and see what kind of
+   problem we have.
+
+   1. When flush reaches a stopping point (e.g. a clean node) it should continue
+   calling squeeze-and-allocate on any remaining unallocated children.
+   FIXME: Difficulty to implement: should be simple -- amounts to adding a while
+   loop to jnode_flush, see comments in that function.
+
+   2. When flush reaches flush_empty_queue(), some of the (level > TWIG) nodes
+   may still have unallocated children. If the twig level has unallocated
+   children it is an assertion failure. If a higher-level node has unallocated
+   children, then it should be explicitly de-allocated by a call to
+   flush_unprep().
+   FIXME: Difficulty to implement: should be simple.
+
+   3. (CPU-Optimization) Checking whether a node has unallocated children may
+   consume more CPU cycles than we would like, and it is possible (but medium
+   complexity) to optimize this somewhat in the case where large sub-trees are
+   flushed. The following observation helps: if both the left- and
+   right-neighbor of a node are processed by the flush algorithm then the node
+   itself is guaranteed to have all of its children allocated. However, the cost
+   of this check may not be so expensive after all: it is not needed for leaves
+   and flush can guarantee this property for twigs. That leaves only (level >
+   TWIG) nodes that have to be checked, so this optimization only helps if at
+   least three (level > TWIG) nodes are flushed in one pass, and the savings
+   will be very small unless there are many more (level > TWIG) nodes. But if
+   there are many (level > TWIG) nodes then the number of blocks being written
+   will be very large, so the savings may be insignificant. That said, the idea
+   is to maintain both the left and right edges of nodes that are processed in
+   flush.  When flush_empty_queue() is called, a relatively simple test will
+   tell whether the (level > TWIG) node is on the edge. If it is on the edge,
+   the slow check is necessary, but if it is in the interior then it can be
+   assumed to have all of its children allocated. FIXME: medium complexity to
+   implement, but simple to verify given that we must have a slow check anyway.
+
+   4. (Optional) This part is optional, not for v4.0--flush should work
+   independently of whether this option is used or not. Called RAPID_SCAN, the
+   idea is to amend the left-scan operation to take unallocated children into
+   account. Normally, the left-scan operation goes left as long as adjacent
+   nodes are dirty up until some large maximum value (FLUSH_SCAN_MAXNODES) at
+   which point it stops and begins flushing. But scan-left may stop at a
+   position where there are unallocated children to the left with the same
+   parent. When RAPID_SCAN is enabled, the ordinary scan-left operation stops
+   after FLUSH_RELOCATE_THRESHOLD, which is much smaller than
+   FLUSH_SCAN_MAXNODES, then procedes with a rapid scan. The rapid scan skips
+   all the interior children of a node--if the leftmost child of a twig is
+   dirty, check its left neighbor (the rightmost child of the twig to the left).
+   If the left neighbor of the leftmost child is also dirty, then continue the
+   scan at the left twig and repeat.  This option will cause flush to allocate
+   more twigs in a single pass, but it also has the potential to write many more
+   nodes than would otherwise be written without the RAPID_SCAN option.
+   RAPID_SCAN was partially implemented, code removed August 12, 2002 by JMACD.
 */
 
-/* FLUSH CALLED ON NON-LEAF LEVEL.  Most of our design considerations assume that the
-   starting point for flush is a leaf node, but actually the flush code cares very little
-   about whether or not this is true.  It is possible that all the leaf nodes are flushed
-   and dirty parent nodes still remain, in which case jnode_flush() is called on a
-   non-leaf argument.  Flush doesn't care--it treats the argument node as if it were a
-   leaf, even when it is not.  This is a simple approach, and there may be a more optimal
-   policy but until a problem with this approach is discovered, simplest is probably best.
-
-   NOTE: In this case, the ordering produced by flush is parent-first only if you ignore
-   the leaves.  This is done as a matter of simplicity and there is only one (shaky)
-   justification.  When an atom commits, it flushes all leaf level nodes first, followed
-   by twigs, and so on.  With flushing done in this order, if flush is eventually called
-   on a non-leaf node it means that (somehow) we reached a point where all leaves are
-   clean and only internal nodes need to be flushed.  If that it the case, then it means
-   there were no leaves that were the parent-first preceder/follower of the parent.  This
-   is expected to be a rare case, which is why we do nothing special about it.  However,
-   memory pressure may pass an internal node to flush when there are still dirty leaf
-   nodes that need to be flushed, which could prove our original assumptions
-   "inoperative".  If this needs to be fixed, then scan_left/right should have
-   special checks for the non-leaf levels.  For example, instead of passing from a node to
-   the left neighbor, it should pass from the node to the left neighbor's rightmost
-   descendent (if dirty).
+/* FLUSH CALLED ON NON-LEAF LEVEL. Most of our design considerations assume that
+   the starting point for flush is a leaf node, but actually the flush code
+   cares very little about whether or not this is true.  It is possible that all
+   the leaf nodes are flushed and dirty parent nodes still remain, in which case
+   jnode_flush() is called on a non-leaf argument. Flush doesn't care--it treats
+   the argument node as if it were a leaf, even when it is not. This is a simple
+   approach, and there may be a more optimal policy but until a problem with
+   this approach is discovered, simplest is probably best.
+
+   NOTE: In this case, the ordering produced by flush is parent-first only if
+   you ignore the leaves. This is done as a matter of simplicity and there is
+   only one (shaky) justification. When an atom commits, it flushes all leaf
+   level nodes first, followed by twigs, and so on. With flushing done in this
+   order, if flush is eventually called on a non-leaf node it means that
+   (somehow) we reached a point where all leaves are clean and only internal
+   nodes need to be flushed. If that it the case, then it means there were no
+   leaves that were the parent-first preceder/follower of the parent. This is
+   expected to be a rare case, which is why we do nothing special about it.
+   However, memory pressure may pass an internal node to flush when there are
+   still dirty leaf nodes that need to be flushed, which could prove our
+   original assumptions "inoperative". If this needs to be fixed, then
+   scan_left/right should have special checks for the non-leaf levels. For
+   example, instead of passing from a node to the left neighbor, it should pass
+   from the node to the left neighbor's rightmost descendent (if dirty).
 
 */
 
-/* UNIMPLEMENTED AS YET: REPACKING AND RESIZING.  We walk the tree in 4MB-16MB chunks, dirtying everything and putting
-   it into a transaction.  We tell the allocator to allocate the blocks as far as possible towards one end of the
-   logical device--the left (starting) end of the device if we are walking from left to right, the right end of the
-   device if we are walking from right to left.  We then make passes in alternating directions, and as we do this the
-   device becomes sorted such that tree order and block number order fully correlate.
+/* UNIMPLEMENTED AS YET: REPACKING AND RESIZING. We walk the tree in 4MB-16MB
+   chunks, dirtying everything and putting it into a transaction. We tell the
+   allocator to allocate the blocks as far as possible towards one end of the
+   logical device--the left (starting) end of the device if we are walking from
+   left to right, the right end of the device if we are walking from right to
+   left.  We then make passes in alternating directions, and as we do this the
+   device becomes sorted such that tree order and block number order fully
+   correlate.
 
-   Resizing is done by shifting everything either all the way to the left or all the way
-   to the right, and then reporting the last block.
+   Resizing is done by shifting everything either all the way to the left or all
+   the way to the right, and then reporting the last block.
 */
 
-/* RELOCATE DECISIONS: The code makes a decision to relocate in several places.  This
-   descibes the policy from the highest level:
+/* RELOCATE DECISIONS: The code makes a decision to relocate in several places.
+   This descibes the policy from the highest level:
 
-   The FLUSH_RELOCATE_THRESHOLD parameter: If we count this many consecutive nodes on the
-   leaf level during flush-scan (right, left), then we unconditionally decide to relocate
-   leaf nodes.
+   The FLUSH_RELOCATE_THRESHOLD parameter: If we count this many consecutive
+   nodes on the leaf level during flush-scan (right, left), then we
+   unconditionally decide to relocate leaf nodes.
 
    Otherwise, there are two contexts in which we make a decision to relocate:
 
    1. The REVERSE PARENT-FIRST context: Implemented in reverse_relocate_test().
-   During the initial stages of flush, after scan-right completes, we want to ask the
-   question: should we relocate this leaf node and thus dirty the parent node.  Then if
-   the node is a leftmost child its parent is its own parent-first preceder, thus we repeat
-   the question at the next level up, and so on.  In these cases we are moving in the
-   reverse-parent first direction.
-
-   There is another case which is considered the reverse direction, which comes at the end
-   of a twig in reverse_relocate_end_of_twig().  As we finish processing a twig we may
-   reach a point where there is a clean twig to the right with a dirty leftmost child.  In
-   this case, we may wish to relocate the child by testing if it should be relocated
-   relative to its parent.
-
-   2. The FORWARD PARENT-FIRST context: Testing for forward relocation is done in
-   allocate_znode.  What distinguishes the forward parent-first case from the
-   reverse-parent first case is that the preceder has already been allocated in the
-   forward case, whereas in the reverse case we don't know what the preceder is until we
-   finish "going in reverse".  That simplifies the forward case considerably, and there we
-   actually use the block allocator to determine whether, e.g., a block closer to the
-   preceder is available.
+   During the initial stages of flush, after scan-right completes, we want to
+   ask the question: should we relocate this leaf node and thus dirty the parent
+   node. Then if the node is a leftmost child its parent is its own parent-first
+   preceder, thus we repeat the question at the next level up, and so on. In
+   these cases we are moving in the reverse-parent first direction.
+
+   There is another case which is considered the reverse direction, which comes
+   at the end of a twig in reverse_relocate_end_of_twig(). As we finish
+   processing a twig we may reach a point where there is a clean twig to the
+   right with a dirty leftmost child. In this case, we may wish to relocate the
+   child by testing if it should be relocated relative to its parent.
+
+   2. The FORWARD PARENT-FIRST context: Testing for forward relocation is done
+   in allocate_znode. What distinguishes the forward parent-first case from the
+   reverse-parent first case is that the preceder has already been allocated in
+   the forward case, whereas in the reverse case we don't know what the preceder
+   is until we finish "going in reverse". That simplifies the forward case
+   considerably, and there we actually use the block allocator to determine
+   whether, e.g., a block closer to the preceder is available.
 */
 
-/* SQUEEZE_LEFT_EDGE: Unimplemented idea for future consideration.  The idea is, once we
-   finish scan-left and find a starting point, if the parent's left neighbor is dirty then
-   squeeze the parent's left neighbor and the parent.  This may change the
-   flush-starting-node's parent.  Repeat until the child's parent is stable.  If the child
-   is a leftmost child, repeat this left-edge squeezing operation at the next level up.
-   Note that we cannot allocate extents during this or they will be out of parent-first
-   order.  There is also some difficult coordinate maintenence issues.  We can't do a tree
-   search to find coordinates again (because we hold locks), we have to determine them
-   from the two nodes being squeezed.  Looks difficult, but has potential to increase
-   space utilization. */
+/* SQUEEZE_LEFT_EDGE: Unimplemented idea for future consideration. The idea is,
+   once we finish scan-left and find a starting point, if the parent's left
+   neighbor is dirty then squeeze the parent's left neighbor and the parent.
+   This may change the flush-starting-node's parent. Repeat until the child's
+   parent is stable. If the child is a leftmost child, repeat this left-edge
+   squeezing operation at the next level up. Note that we cannot allocate
+   extents during this or they will be out of parent-first order. There is also
+   some difficult coordinate maintenence issues.  We can't do a tree search to
+   find coordinates again (because we hold locks), we have to determine them
+   from the two nodes being squeezed. Looks difficult, but has potential to
+   increase space utilization. */
 
 /* Flush-scan helper functions. */
 static void scan_init(flush_scan * scan);
@@ -393,7 +412,8 @@ static int alloc_pos_and_ancestors(flush
 static int alloc_one_ancestor(const coord_t *coord, flush_pos_t *pos);
 static int set_preceder(const coord_t *coord_in, flush_pos_t *pos);
 
-/* Main flush algorithm.  Note on abbreviation: "squeeze and allocate" == "squalloc". */
+/* Main flush algorithm. 
+   Note on abbreviation: "squeeze and allocate" == "squalloc". */
 static int squalloc(flush_pos_t *pos);
 
 /* Flush squeeze implementation. */
@@ -423,7 +443,7 @@ static int jnode_lock_parent_coord(jnode
 				   load_count * parent_zh,
 				   znode_lock_mode mode, int try);
 static int neighbor_in_slum(znode * node, lock_handle * right_lock, sideof side,
-			    znode_lock_mode mode, int check_dirty, int expected);
+			   znode_lock_mode mode, int check_dirty, int expected);
 static int znode_same_parents(znode * a, znode * b);
 
 static int znode_check_flushprepped(znode * node)
@@ -447,9 +467,9 @@ assert("nikita-3435",							\
 	    extent_is_unallocated(&scan->parent_coord),			\
 	    extent_unit_index(&scan->parent_coord) == index_jnode(scan->node)))
 
-/* This flush_cnt variable is used to track the number of concurrent flush operations,
-   useful for debugging.  It is initialized in txnmgr.c out of laziness (because flush has
-   no static initializer function...) */
+/* This flush_cnt variable is used to track the number of concurrent flush
+   operations, useful for debugging. It is initialized in txnmgr.c out of
+   laziness (because flush has no static initializer function...) */
 ON_DEBUG(atomic_t flush_cnt;
     )
 
@@ -550,8 +570,8 @@ static int prepare_flush_pos(flush_pos_t
 		if (ret)
 			goto done;
 		if (!item_is_extent(&parent_coord)) {
-			/* file was converted to tail, org became HB, we found internal
-			   item */
+			/* file was converted to tail, org became HB, we found
+			   internal item */
 			ret = -EAGAIN;
 			goto done;
 		}
@@ -561,96 +581,107 @@ static int prepare_flush_pos(flush_pos_t
 		pos->child = jref(org);
 		if (extent_is_unallocated(&parent_coord)
 		    && extent_unit_index(&parent_coord) != index_jnode(org)) {
-			/* @org is not first child of its parent unit. This may happen
-			   because longerm lock of its parent node was released between
-			   scan_left and scan_right. For now work around this having flush to repeat */
+			/* @org is not first child of its parent unit. This may
+			   happen because longerm lock of its parent node was
+			   released between scan_left and scan_right. For now
+			   work around this having flush to repeat */
 			ret = -EAGAIN;
 		}
 	}
 
-      done:
+done:
 	done_load_count(&load);
 	done_lh(&lock);
 	return ret;
 }
 
 /* TODO LIST (no particular order): */
-/* I have labelled most of the legitimate FIXME comments in this file with letters to
-   indicate which issue they relate to.  There are a few miscellaneous FIXMEs with
-   specific names mentioned instead that need to be inspected/resolved. */
+/* I have labelled most of the legitimate FIXME comments in this file with
+   letters to indicate which issue they relate to. There are a few miscellaneous
+   FIXMEs with specific names mentioned instead that need to be
+   inspected/resolved. */
 /* B. There is an issue described in reverse_relocate_test having to do with an
-   imprecise is_preceder? check having to do with partially-dirty extents.  The code that
-   sets preceder hints and computes the preceder is basically untested.  Careful testing
-   needs to be done that preceder calculations are done correctly, since if it doesn't
-   affect correctness we will not catch this stuff during regular testing. */
-/* C. EINVAL, E_DEADLOCK, E_NO_NEIGHBOR, ENOENT handling.  It is unclear which of these are
-   considered expected but unlikely conditions.  Flush currently returns 0 (i.e., success
-   but no progress, i.e., restart) whenever it receives any of these in jnode_flush().
-   Many of the calls that may produce one of these return values (i.e.,
-   longterm_lock_znode, reiser4_get_parent, reiser4_get_neighbor, ...) check some of these
-   values themselves and, for instance, stop flushing instead of resulting in a restart.
-   If any of these results are true error conditions then flush will go into a busy-loop,
-   as we noticed during testing when a corrupt tree caused find_child_ptr to return
-   ENOENT.  It needs careful thought and testing of corner conditions.
+   imprecise is_preceder? check having to do with partially-dirty extents. The
+   code that sets preceder hints and computes the preceder is basically
+   untested. Careful testing needs to be done that preceder calculations are
+   done correctly, since if it doesn't affect correctness we will not catch this
+   stuff during regular testing. */
+/* C. EINVAL, E_DEADLOCK, E_NO_NEIGHBOR, ENOENT handling. It is unclear which of
+   these are considered expected but unlikely conditions. Flush currently
+   returns 0 (i.e., success but no progress, i.e., restart) whenever it receives
+   any of these in jnode_flush(). Many of the calls that may produce one of
+   these return values (i.e., longterm_lock_znode, reiser4_get_parent,
+   reiser4_get_neighbor, ...) check some of these values themselves and, for
+   instance, stop flushing instead of resulting in a restart. If any of these
+   results are true error conditions then flush will go into a busy-loop, as we
+   noticed during testing when a corrupt tree caused find_child_ptr to return
+   ENOENT. It needs careful thought and testing of corner conditions.
 */
-/* D. Atomicity of flush_prep against deletion and flush concurrency.  Suppose a created
-   block is assigned a block number then early-flushed to disk.  It is dirtied again and
-   flush is called again.  Concurrently, that block is deleted, and the de-allocation of
-   its block number does not need to be deferred, since it is not part of the preserve set
-   (i.e., it didn't exist before the transaction).  I think there may be a race condition
-   where flush writes the dirty, created block after the non-deferred deallocated block
-   number is re-allocated, making it possible to write deleted data on top of non-deleted
-   data.  Its just a theory, but it needs to be thought out. */
+/* D. Atomicity of flush_prep against deletion and flush concurrency. Suppose a
+   created block is assigned a block number then early-flushed to disk. It is
+   dirtied again and flush is called again. Concurrently, that block is deleted,
+   and the de-allocation of its block number does not need to be deferred, since
+   it is not part of the preserve set (i.e., it didn't exist before the
+   transaction). I think there may be a race condition where flush writes the
+   dirty, created block after the non-deferred deallocated block number is
+   re-allocated, making it possible to write deleted data on top of non-deleted
+   data. Its just a theory, but it needs to be thought out. */
 /* F. bio_alloc() failure is not handled gracefully. */
 /* G. Unallocated children. */
-/* H. Add a WANDERED_LIST to the atom to clarify the placement of wandered blocks. */
+/* H. Add a WANDERED_LIST to the atom to clarify the placement of wandered
+   blocks. */
 /* I. Rename flush-scan to scan-point, (flush-pos to flush-point?) */
 
 /* JNODE_FLUSH: MAIN ENTRY POINT */
-/* This is the main entry point for flushing a jnode and its dirty neighborhood (dirty
-   neighborhood is named "slum").  Jnode_flush() is called if reiser4 has to write dirty
-   blocks to disk, it happens when Linux VM decides to reduce number of dirty pages or as
-   a part of transaction commit.
-
-   Our objective here is to prep and flush the slum the jnode belongs to. We want to
-   squish the slum together, and allocate the nodes in it as we squish because allocation
-   of children affects squishing of parents.
-
-   The "argument" @node tells flush where to start.  From there, flush finds the left edge
-   of the slum, and calls squalloc (in which nodes are squeezed and allocated).  To find a
-   "better place" to start squalloc first we perform a flush_scan.
-
-   Flush-scanning may be performed in both left and right directions, but for different
-   purposes.  When scanning to the left, we are searching for a node that precedes a
-   sequence of parent-first-ordered nodes which we will then flush in parent-first order.
-   During flush-scanning, we also take the opportunity to count the number of consecutive
-   leaf nodes.  If this number is past some threshold (FLUSH_RELOCATE_THRESHOLD), then we
-   make a decision to reallocate leaf nodes (thus favoring write-optimization).
-
-   Since the flush argument node can be anywhere in a sequence of dirty leaves, there may
-   also be dirty nodes to the right of the argument.  If the scan-left operation does not
-   count at least FLUSH_RELOCATE_THRESHOLD nodes then we follow it with a right-scan
-   operation to see whether there is, in fact, enough nodes to meet the relocate
-   threshold.  Each right- and left-scan operation uses a single flush_scan object.
-
-   After left-scan and possibly right-scan, we prepare a flush_position object with the
-   starting flush point or parent coordinate, which was determined using scan-left.
-
-   Next we call the main flush routine, squalloc, which iterates along the
-   leaf level, squeezing and allocating nodes (and placing them into the flush queue).
+/* This is the main entry point for flushing a jnode and its dirty neighborhood
+   (dirty neighborhood is named "slum"). Jnode_flush() is called if reiser4 has
+   to write dirty blocks to disk, it happens when Linux VM decides to reduce
+   number of dirty pages or as a part of transaction commit.
+
+   Our objective here is to prep and flush the slum the jnode belongs to. We
+   want to squish the slum together, and allocate the nodes in it as we squish
+   because allocation of children affects squishing of parents.
+
+   The "argument" @node tells flush where to start. From there, flush finds the
+   left edge of the slum, and calls squalloc (in which nodes are squeezed and
+   allocated). To find a "better place" to start squalloc first we perform a
+   flush_scan.
+
+   Flush-scanning may be performed in both left and right directions, but for
+   different purposes. When scanning to the left, we are searching for a node
+   that precedes a sequence of parent-first-ordered nodes which we will then
+   flush in parent-first order. During flush-scanning, we also take the
+   opportunity to count the number of consecutive leaf nodes. If this number is
+   past some threshold (FLUSH_RELOCATE_THRESHOLD), then we make a decision to
+   reallocate leaf nodes (thus favoring write-optimization).
+
+   Since the flush argument node can be anywhere in a sequence of dirty leaves,
+   there may also be dirty nodes to the right of the argument. If the scan-left
+   operation does not count at least FLUSH_RELOCATE_THRESHOLD nodes then we
+   follow it with a right-scan operation to see whether there is, in fact,
+   enough nodes to meet the relocate threshold. Each right- and left-scan
+   operation uses a single flush_scan object.
+
+   After left-scan and possibly right-scan, we prepare a flush_position object
+   with the starting flush point or parent coordinate, which was determined
+   using scan-left.
+
+   Next we call the main flush routine, squalloc, which iterates along the leaf
+   level, squeezing and allocating nodes (and placing them into the flush
+   queue).
 
    After squalloc returns we take extra steps to ensure that all the children
    of the final twig node are allocated--this involves repeating squalloc
    until we finish at a twig with no unallocated children.
 
-   Finally, we call flush_empty_queue to submit write-requests to disk.  If we encounter
-   any above-twig nodes during flush_empty_queue that still have unallocated children, we
-   flush_unprep them.
-
-   Flush treats several "failure" cases as non-failures, essentially causing them to start
-   over.  E_DEADLOCK is one example.  FIXME:(C) EINVAL, E_NO_NEIGHBOR, ENOENT: these should
-   probably be handled properly rather than restarting, but there are a bunch of cases to
-   audit.
+   Finally, we call flush_empty_queue to submit write-requests to disk. If we
+   encounter any above-twig nodes during flush_empty_queue that still have
+   unallocated children, we flush_unprep them.
+
+   Flush treats several "failure" cases as non-failures, essentially causing
+   them to start over. E_DEADLOCK is one example.
+   FIXME:(C) EINVAL, E_NO_NEIGHBOR, ENOENT: these should probably be handled
+   properly rather than restarting, but there are a bunch of cases to audit.
 */
 
 static int