On Fri, May 06, 2022 at 11:01:23PM -0700, Alli wrote: > On Fri, 2022-05-06 at 19:45 +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > We can't use the same algorithm for replacing an existing attribute > > when logging attributes. The existing algorithm is essentially: > > > > 1. create new attr w/ INCOMPLETE > > 2. atomically flip INCOMPLETE flags between old + new attribute > > 3. remove old attr which is marked w/ INCOMPLETE > > > > This algorithm guarantees that we see either the old or new > > attribute, and if we fail after the atomic flag flip, we don't have > > to recover the removal of the old attr because we never see > > INCOMPLETE attributes in lookups. > > > > For logged attributes, however, this does not work. The logged > > attribute intents do not track the work that has been done as the > > transaction rolls, and hence the only recovery mechanism we have is > > "run the replace operation from scratch". > > > > This is further exacerbated by the attempt to avoid needing the > > INCOMPLETE flag to create an atomic swap. This means we can create > > a second active attribute of the same name before we remove the > > original. If we fail at any point after the create but before the > > removal has completed, we end up with duplicate attributes in > > the attr btree and recovery only tries to replace one of them. > > > > There are several other failure modes where we can leave partially > > allocated remote attributes that expose stale data, partially free > > remote attributes that enable UAF based stale data exposure, etc. > > > > TO fix this, we need a different algorithm for replace operations > > when LARP is enabled. Luckily, it's not that complex if we take the > > right first step. That is, the first thing we log is the attri > > intent with the new name/value pair and mark the old attr as > > INCOMPLETE in the same transaction. > > > > From there, we then remove the old attr and keep relogging the > > new name/value in the intent, such that we always know that we have > > to create the new attr in recovery. Once the old attr is removed, > > we then run a normal ATTR_CREATE operation relogging the intent as > > we go. If the new attr is local, then it gets created in a single > > atomic transaction that also logs the final intent done. If the new > > attr is remote, the we set INCOMPLETE on the new attr while we > > allocate and set the remote value, and then we clear the INCOMPLETE > > flag at in the last transaction taht logs the final intent done. > > > > If we fail at any point in this algorithm, log recovery will always > > see the same state on disk: the new name/value in the intent, and > > either an INCOMPLETE attr or no attr in the attr btree. If we find > > an INCOMPLETE attr, we run the full replace starting with removing > > the INCOMPLETE attr. If we don't find it, then we simply create the > > new attr. > > > > Notably, recovery of a failed create that has an INCOMPLETE flag set > > is now the same - we start with the lookup of the INCOMPLETE attr, > > and if that exists then we do the full replace recovery process, > > otherwise we just create the new attr. > > > > Hence changing the way we do the replace operation when LARP is > > enabled allows us to use the same log recovery algorithm for both > > the ATTR_CREATE and ATTR_REPLACE operations. This is also the same > > algorithm we use for runtime ATTR_REPLACE operations (except for the > > step setting up the initial conditions). > > > > The result is that: > > > > - ATTR_CREATE uses the same algorithm regardless of whether LARP is > > enabled or not > > - ATTR_REPLACE with larp=0 is identical to the old algorithm > > - ATTR_REPLACE with larp=1 runs an unmodified attr removal algorithm > > from the larp=0 code and then runs the unmodified ATTR_CREATE > > code. > > - log recovery when larp=1 runs the same ATTR_REPLACE algorithm as > > it uses at runtime. > > > > Because the state machine is now quite clean, changing the algorithm > > is really just a case of changing the initial state and how the > > states link together for the ATTR_REPLACE case. Hence it's not a > > huge amoutn of code for what is a fairly substantial rework > > of the attr logging and recovery algorithm.... > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > This looks mostly good, though when I run it with the new tests, I seem > to run into a failed filesystem check at the end. "bad attribute count > 0 in attr block 0". I suspect we still dont have the removal of the > fork quite right. It sounds like you're still working with things > though, I'll keep looking too. Yeah, that's a strange one. I didn't get it with the based branch testing (based on 5.18-rc+for-next) but over the weekend where it got merged with 5.18-rc5 it appears that the error has manifested in several test runs. I'll dig into it this morning. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx