On Fri, Apr 05, 2019 at 01:54:47PM -0700, Darrick J. Wong wrote: > On Fri, Apr 05, 2019 at 09:07:39AM -0400, Brian Foster wrote: > > On Thu, Apr 04, 2019 at 11:01:33AM -0700, Darrick J. Wong wrote: > > > On Thu, Apr 04, 2019 at 07:50:11AM -0400, Brian Foster wrote: > > > > On Mon, Apr 01, 2019 at 10:11:12AM -0700, Darrick J. Wong wrote: > > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > > > > > Now that we have the ability to track sick metadata in-core, make scrub > > > > > and repair update those health assessments after doing work. > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > --- > > > > > fs/xfs/Makefile | 1 > > > > > fs/xfs/scrub/health.c | 180 +++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > fs/xfs/scrub/health.h | 12 +++ > > > > > fs/xfs/scrub/scrub.c | 8 ++ > > > > > fs/xfs/scrub/scrub.h | 11 +++ > > > > > 5 files changed, 212 insertions(+) > > > > > create mode 100644 fs/xfs/scrub/health.c > > > > > create mode 100644 fs/xfs/scrub/health.h > > > > > > > > > > > > > > ... > > > > > diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c > > > > > new file mode 100644 > > > > > index 000000000000..dd9986500801 > > > > > --- /dev/null > > > > > +++ b/fs/xfs/scrub/health.c > > > > > @@ -0,0 +1,180 @@ > > > > ... > > > > > +/* Update filesystem health assessments based on what we found and did. */ > > > > > +void > > > > > +xchk_update_health( > > > > > + struct xfs_scrub *sc, > > > > > + bool already_fixed) > > > > > +{ > > > > > + /* > > > > > + * If the scrubber finds errors, we mark sick whatever's mentioned in > > > > > + * sick_mask, no matter whether this is a first scan or an evaluation > > > > > + * of repair effectiveness. > > > > > + * > > > > > + * If there is no direct corruption and we're called after a repair, > > > > > + * clear whatever's in heal_mask because that's what we fixed. > > > > > + * > > > > > + * Otherwise, there's no direct corruption and we didn't repair > > > > > + * anything, so mark whatever's in sick_mask as healthy. > > > > > + */ > > > > > + if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) > > > > > + xchk_mark_sick(sc, sc->sick_mask); > > > > > + else if (already_fixed) > > > > > + xchk_mark_healthy(sc, sc->heal_mask); > > > > > + else > > > > > + xchk_mark_healthy(sc, sc->sick_mask); > > > > > +} > > > > > > > > Hmm, I think I follow what we're doing here but it's a bit confusing > > > > without the additional context of where these bits will be set/cleared > > > > at the lower scrub layers (or at least without an example). Some > > > > questions on that below... > > > > > > > > ... > > > > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c > > > > > index 1b2344d00525..b1519dfc5811 100644 > > > > > --- a/fs/xfs/scrub/scrub.c > > > > > +++ b/fs/xfs/scrub/scrub.c > > > > > @@ -40,6 +40,7 @@ > > > > > #include "scrub/trace.h" > > > > > #include "scrub/btree.h" > > > > > #include "scrub/repair.h" > > > > > +#include "scrub/health.h" > > > > > > > > > > /* > > > > > * Online Scrub and Repair > > > > > @@ -468,6 +469,7 @@ xfs_scrub_metadata( > > > > > { > > > > > struct xfs_scrub sc; > > > > > struct xfs_mount *mp = ip->i_mount; > > > > > + unsigned int heal_mask; > > > > > bool try_harder = false; > > > > > bool already_fixed = false; > > > > > int error = 0; > > > > > @@ -488,6 +490,7 @@ xfs_scrub_metadata( > > > > > error = xchk_validate_inputs(mp, sm); > > > > > if (error) > > > > > goto out; > > > > > + heal_mask = xchk_health_mask_for_scrub_type(sm->sm_type); > > > > > > > > > > xchk_experimental_warning(mp); > > > > > > > > > > @@ -499,6 +502,8 @@ xfs_scrub_metadata( > > > > > sc.ops = &meta_scrub_ops[sm->sm_type]; > > > > > sc.try_harder = try_harder; > > > > > sc.sa.agno = NULLAGNUMBER; > > > > > + sc.heal_mask = heal_mask; > > > > > + sc.sick_mask = xchk_health_mask_for_scrub_type(sm->sm_type); > > > > > > > > Ok, so we initialize the heal/sick masks based on the scrub type that > > > > was requested on the first pass through... > > > > > > > > > error = sc.ops->setup(&sc, ip); > > > > > if (error) > > > > > goto out_teardown; > > > > > @@ -519,6 +524,8 @@ xfs_scrub_metadata( > > > > > } else if (error) > > > > > goto out_teardown; > > > > > > > > > > + xchk_update_health(&sc, already_fixed); > > > > > + > > > > > > > > ... then update the in-core fs health state based on the sick mask. Is > > > > it possible for the scrub operation to set more sick mask bits based on > > > > what it finds? > > > > > > Theoretically, yes, but in practice none of the current scrubbers need > > > to touch sick_mask. > > > > > > heal_mask, OTOH, will be adjusted by the free space / inode repair > > > functions since they rebuild multiple structures. > > > > > > > Ok.. > > > > > > More specifically, I'm wondering why the masks wouldn't start as zero > > > > and toggle based on finding/fixing corruption(s). > > > > > > sick_mask is also the mask we feed to xfs_*_mark_healthy if the scan > > > returns clean, which is why we set the default value before dispatching > > > the scrub. > > > > > > > Or if the sick mask value is essentially fixed, whether we need to > > > > store it in the xfs_scrub context... > > > > > > We could probably get away with generating it in xchk_update_health at > > > the end, but it feels weird to have heal_mask in the scrub context but > > > sick_mask gets auto-generated. > > > > > > > Ok.. hmm. Both feel a little weird to me, but this is really just an > > aesthetic/factoring thing so I'll think about it a bit more and come > > back to it. > > > > > > > > > > > if ((sc.sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) && !already_fixed) { > > > > > bool needs_fix; > > > > > > > > > > @@ -551,6 +558,7 @@ xfs_scrub_metadata( > > > > > xrep_failure(mp); > > > > > goto out; > > > > > } > > > > > + heal_mask = sc.heal_mask; > > > > > > > > And if we end up doing a repair, we presumably can repair multiple > > > > things and so we track that separately and persist the heal mask across > > > > a potential retry. > > > > > > Right. > > > > > > > What about the case where we don't retry, but scrub finds something > > > > and then immediately repairs it? > > > > > > The repair jumps back to retry_op if either (a) we couldn't get all the > > > resources we needed and therefore sc.try_harder = true and we need to > > > start over; or (b) repair thinks it fixed a thing, so we need to scrub > > > the thing again to see if it's really fixed... > > > > > > > Should we update the fs state after both detecting and clearing the > > > > problem, or does that happen elsewhere? > > > > > > ...so if scrub immediately repairs a thing, we preserve heal_mask, jump > > > back to the scrub, and if the scrub says clean we'll mark heal mask > > > healthy. > > > > > > If the repair has to retry then the we'll call the repair function > > > again, which (presumably) will set (again) the heal_mask appropriately, > > > and then we have the same post-repair state updating as above. > > > > > > Does that make sense? :) > > > > > > > Ah, Ok. I didn't realize that a successful repair looped back to the > > scrub code (and thus the health update). Yes, that makes more sense. > > > > > > Also, if repair can potentially clear multiple bits, what's the > > > > possibility of a repair clearing one failure and then failing on > > > > another, causing the broader repair op to return an error or jump into > > > > this retry? > > > > > > Scrub doesn't touch the fs health state at all until after the ->scrub > > > or ->repair function succeeds. If the scrub or the repair functions > > > fail for any non-retry reason, we back out to userspace without updating > > > anything. It's as if we'd never called the failed function. > > > > > > > Right.. what I was getting at above is seeing whether we'd actually > > update partial repair state in-core. E.g., suppose things A and B are > > faulted in-core and it's one of these cases where repair can fix A and B > > at the same time. If it fixes thing A and fails on thing B, it sounds > > like we'd not clear the in-core fault state on A even though it's > > technically repaired. > > Hmm. If the repair function returns a runtime error (having fixed A but > not B) then yes, we won't clear the incore fault state on A (or B) even > though we fixed A. Something weird happened, so we shouldn't be too > hasty to clear things. A subsequent re-scrub of A will clear the fault > on A, though. > Ok. Indeed, it doesn't seem that unreasonable to me for an operational error to fail to clear health state for something that was repaired. > OTOH... if the A/B repair function returns 0 having fixed A but left B > corrupt, the rescan will see that A is fine and (incorrectly) clear both > A and B. I would say that's a bug, so maybe I should rethink the need > for sick_mask and heal_mask. > That one sounds more dodgy. ;P > That said, a normal xfs_scrub run will check (or have already checked) B > and noticed that it was corrupt, so it will circle back and try to fix B > separately, so in a sense we don't really need heal_mask at all. > Ok.. > > > Maybe some worked examples will help? > > > > > > Let's say both inode btrees are corrupt. We run xfs_scrub -n, > > > xchk_inobt will record the corruption, and (assuming it hits no runtime > > > errors) once we return to xfs_scrub_metadata, it'll set > > > XFS_SICK_AG_INOBT. Presumably xfs_scrub will also call the finobt scrub > > > and SICK_AG_FINOBT will also get set. > > > > > > If we run xfs_scrub without the -n, xchk_inobt will record the > > > corruption and set SICK_AG_INOBT per above. Then it'll run xrep_inobt, > > > which will set heal_mask to SICK_AG_INOBT | SICK_AG_FINOBT. If the > > > repair fails with a non-retry runtime error, we exit to userspace and > > > ignore heal_mask. > > > > > > > Ok, this sounds like the case I'm theorizing about above (where suppose > > repair fixed the inobt and then failed on the finobt, but hasn't cleared > > faults for either..). > > > > > If instead the repair succeeds, we scan the inobt again. If that comes > > > up clear then we use heal_mask to clear SICK_AG_INOBT | SICK_AG_FINOBT. > > > xfs_scrub will call again later to repair the finobt, but the initial > > > finobt scan will see no errors in the finobt, clear SICK_AG_FINOBT > > > (which isn't set) and exit. > > > > > > > So it sounds like the state would have to be cleared by a subsequent > > scrub request. The scan would find thing A healthy and mark it so > > regardless, to clear any potential previous faults that might have > > already been repaired. Right? > > Right. > > > > If the inobt repair function is buggy and says it repaired the inode > > > btrees but leaves corruptions, then the rescan of the inobt will notice > > > and set SICK_AG_INOBT (which is already set) and exit. Similarly, when > > > xfs_scrub calls back about the finobt, it will notice the corrupt > > > finobt, try to set SICK_AG_FINOBT (also already set), try to fix it, and > > > the rescan of the finobt will notice that the finobt is still corrupt > > > and try to set SICK_AG_FINOBT (which is still set). > > > > > > The end result (I think) is that we always set the sick bits if a scan > > > shows problems, and we only clear the sick bits for things if we can > > > prove that the things are no longer sick. Does that help? > > > > > > > Yes, thanks for the explanation. I think the confusion is mostly due to > > not being able to fully see how these scrub states are managed, > > particularly the bits that warranted the creation of separate masks in > > the first place. > > You've convinced me that this patch is too convoluted to understand, so > I think I want to simplify it some more. First, I'd rename the field > to "sick_mask_update" and change the behavior so that we: > > 1. Set sick_mask_update to the default XFS_SICK flag for this scrub > type (call it A). (We already do this) > > 2. If the scrubber returns an error code, we exit making no changes to > the incore sick state. > > 3. If the scrubber finds that A is clean, clear the incore sick flags > that are set in s_m_u and exit. > > 4. If the scrubber finds that A is corrupt, set the incore sick flags > that are set in s_m_u. > > a. If the user doesn't want to repair, then we exit, having > previously set incore sick flags. > > 5. Now we know that A is corrupt and the user wants to repair. > If repair returns an error code, we exit with that error code, having > made no further changes to the incore sick state. > > 6. If repair rebuilds both A & B correctly and the re-scrub of A is > clean, we'll clear the incore sick flags using s_m_u. This should > clear A. > > 7. If repair rebuilds both A & B and screws up A, the re-scrub will find > it corrupt and leave the sick flags as they are, which is to say that > A is marked sick. > > 8. If repair rebuilds A correctly but leaves B corrupt, the re-scrub of > A will be clean and we'll clear the incore sick flags using s_m_u. > This should clear A, even though B is corrupt. > > 9. No matter whether we encountered scenarios 6, 7, or 8, if xfs_scrub > previously scrubbed B and found it corrupt, it will call again to > repair B, which will set the incore sick state appropriately. If > xfs_scrub has not yet scrubbed B then it will call later to scrub B, > which will set the incore sick state appropriately. > > I hope that's easier to understand... > It sounds like the primary difference here is trading off the ability to clear both A and B flags at the same time during a scrub+repair of A, and rather rely on the separate scrub of B to detect that B is no longer corrupt. That sounds much more straightforward to me provided it works well enough with the userspace tool (i.e., xfs_scrub will eventually mark B healthy before it returns either way). It simplifies the tracking and if we consider the normal sequence for a corrupted thing should be scrub(A) -> setcorrupt(A) -> repair(A) -> scrub(A) -> sethealthy(A), then clearing in-core sick state of B at the end kind of violates the model where we'd expect another scrub(B) to take place first. Brian > > This does still have me wondering if separate masks are necessary, if we > > perhaps had more selective health update logic, for example. I think it > > might be better to either bundle this patch with whatever other changes > > actually make use of the separate masks, or alternatively to simplify > > the current logic and just defer the separate mask thing until those > > more complex repair algorithms come along.. > > --D > > > Brian > > > > > > ISTM that it might be possible to skip clearing one fail state bit so > > > > long as the original thing remained corrupted, but I feel like I'm > > > > still missing some context on the bigger picture scrub tracking... > > > > > > Yeah, the state machine is pretty squirrely. :/ > > > > > > --D > > > > > > > Brian > > > > > > > > > goto retry_op; > > > > > } > > > > > } > > > > > diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h > > > > > index 22f754fba8e5..05f1ad242a35 100644 > > > > > --- a/fs/xfs/scrub/scrub.h > > > > > +++ b/fs/xfs/scrub/scrub.h > > > > > @@ -62,6 +62,17 @@ struct xfs_scrub { > > > > > struct xfs_inode *ip; > > > > > void *buf; > > > > > uint ilock_flags; > > > > > + > > > > > + /* Metadata to be marked sick if scrub finds errors. */ > > > > > + unsigned int sick_mask; > > > > > + > > > > > + /* > > > > > + * Metadata to be marked healthy if repair fixes errors. Some repair > > > > > + * functions can fix multiple data structures at once, so we have to > > > > > + * treat sick and heal masks separately. > > > > > + */ > > > > > + unsigned int heal_mask; > > > > > + > > > > > bool try_harder; > > > > > bool has_quotaofflock; > > > > > > > > > >