Re: [PATCH 0/5] * Introduce new space allocation algorithm *

Stephen Zhang <starzhangzsd@xxxxxxxxx> · Thu, 21 Nov 2024 15:17:04 +0800

Dave Chinner <david@xxxxxxxxxxxxx> 于2024年11月21日周四 06:53写道：
>
> On Sun, Nov 17, 2024 at 09:34:53AM +0800, Stephen Zhang wrote:
> > Dave Chinner <david@xxxxxxxxxxxxx> 于2024年11月11日周一 10:04写道：
> > >
> > > On Fri, Nov 08, 2024 at 09:34:17AM +0800, Stephen Zhang wrote:
> > > > Dave Chinner <david@xxxxxxxxxxxxx> 于2024年11月4日周一 20:15写道：
> > > > > On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote:
> > > > > > Dave Chinner <david@xxxxxxxxxxxxx> 于2024年11月4日周一 11:32写道：
> > > > > > > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
> > >
> > > [snip unnecessary stereotyping, accusations and repeated information]
> > >
> > > > > AFAICT, this "reserve AG space for inodes" behaviour that you are
> > > > > trying to acheive is effectively what the inode32 allocator already
> > > > > implements. By forcing inode allocation into the AGs below 1TB and
> > > > > preventing data from being allocated in those AGs until allocation
> > > > > in all the AGs above start failing, it effectively provides the same
> > > > > functionality but without the constraints of a global first fit
> > > > > allocation policy.
> > > > >
> > > > > We can do this with any AG by setting it up to prefer metadata,
> > > > > but given we already have the inode32 allocator we can run some
> > > > > tests to see if setting the metadata-preferred flag makes the
> > > > > existing allocation policies do what is needed.
> > > > >
> > > > > That is, mkfs a new 2TB filesystem with the same 344AG geometry as
> > > > > above, mount it with -o inode32 and run the workload that fragments
> > > > > all the free space. What we should see is that AGs in the upper TB
> > > > > of the filesystem should fill almost to full before any significant
> > > > > amount of allocation occurs in the AGs in the first TB of space.
> > >
> > > Have you performed this experiment yet?
> > >
> > > I did not ask it idly, and I certainly did not ask it with the intent
> > > that we might implement inode32 with AFs. It is fundamentally
> > > impossible to implement inode32 with the proposed AF feature.
> > >
> > > The inode32 policy -requires- top down data fill so that AG 0 is the
> > > *last to fill* with user data. The AF first-fit proposal guarantees
> > > bottom up fill where AG 0 is the *first to fill* with user data.
> > >
> > > For example:
> > >
> > > > So for the inode32 logarithm:
> > > > 1. I need to specify a preferred ag, like ag 0:
> > > > |----------------------------
> > > > | ag 0 | ag 1 | ag 2 | ag 3 |
> > > > +----------------------------
> > > > 2. Someday space will be used up to 100%, Then we have to growfs to ag 7:
> > > > +------+------+------+------+------+------+------+------+
> > > > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > > > +------+------+------+------+------+------+------+------+
> > > > 3. specify another ag for inodes again.
> > > > 4. repeat 1-3.
> > >
> > > Lets's assume that AGs are 512GB each and so AGs 0 and 1 fill the
> > > entire lower 1TB of the filesystem. Hence if we get to all AGs full
> > > the entire inode32 inode allocation space is full.
> > >
> > > Even if we grow the filesystem at this point, we still *cannot*
> > > allocate more inodes in the inode32 space. That space (AGs 0-1) is
> > > full even after the growfs.  Hence we will still give ENOSPC, and
> > > that is -correct behaviour- because the inode32 policy requires this
> > > behaviour.
> > >
> > > IOWs, growfs and changing the AF bounds cannot fix ENOSPC on inode32
> > > when the inode space is exhausted. Only physically moving data out
> > > of the lower AGs can fix that problem...
> > >
> > > > for the AF logarithm:
> > > >     mount -o af1=1 $dev $mnt
> > > > and we are done.
> > > > |<-----+ af 0 +----->|<af 1>|
> > > > |----------------------------
> > > > | ag 0 | ag 1 | ag 2 | ag 3 |
> > > > +----------------------------
> > > > because the af is a relative number to ag_count, so when growfs, it will
> > > > become:
> > > > |<-----+ af 0 +--------------------------------->|<af 1>|
> > > > +------+------+------+------+------+------+------+------+
> > > > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > > > +------+------+------+------+------+------+------+------+
> > > > So just set it once, and run forever.
> > >
> > > That is actually the general solution to the original problem being
> > > reported. I realised this about half way through reading your
> > > original proposal. This is why I pointed out inode32 and the
> > > preferred metadata mechanism in the AG allocator policies.
> > >
> > > That is, a general solution should only require the highest AG
> > > to be marked as metadata preferred. Then -all- data allocation will
> > > then skip over the highest AG until there is no space left in any of
> > > the lower AGs. This behaviour will be enforced by the existing AG
> > > iteration allocation algorithms without any change being needed.
> > >
> > > Then when we grow the fs, we set the new highest AG to be metadata
> > > preferred, and that space will now be reserved for inodes until all
> > > other space is consumed.
> > >
> > > Do you now understand why I asked you to test whether the inode32
> > > mount option kept the data out of the lower AGs until the higher AGs
> > > were completely filled? It's because I wanted confirmation that the
> > > metadata preferred flag would do what we need to implement a
> > > general solution for the problematic workload.
> > >
> >
> > Hi, I have tested the inode32 mount option. To my suprise, the inode32
> > or the metadata preferred structure (will be referred to as inode32 for the
> > rest reply) doesn't implement the desired behavior as the AF rule[1] does:
> >         Lower AFs/AGs will do anything they can for allocation before going
> > to HIGHER/RESERVED AFs/AGS. [1]
>
> This isn't important or relevant to the experiment I asked you to
> perform and report the results of.
>
> I asked you to observe and report the filesystem fill pattern in
> your environment when metadata preferred AGs are enabled. It isn't
> important whether inode32 exactly solves your problem, what I want
> to know is whether the underlying mechanism has sufficient control
> to provide a general solution that is always enabled.
>
> This is foundational engineering process: check your hypothesis work
> as you expect before building more stuff on top of them. i.e.
> perform experiments to confirm your ideas will work before doing
> anything else.
>
> If you answer a request for an experiment to be run with "theory
> tells me it won't work" then you haven't understood why you were
> asked to run an experiment in the first place.
>

If I understand your reply correctly, then maybe my expression is the
problem. What I replied before is:
1. I have tested the inode32 option with the metadata preferred AGs
enabled(Yeah, I do check if the AG is set with
XFS_AGSTATE_PREFERS_METADATA). And with the alternating-
punching pattern, I observed that the preferred AG will still get fragmented
quickly, but the AF will not.
(That's what I meant in the first sentence of my previous reply...)
2. Then I tried to explain why it doesn't work in theory.

Sorry for any misunderstanding because of my unclear reply.

Cheers,
Shida

> If you can't run requested experiments or don't understand why an
> expert might be asking for that experiment to be run, then say so.
> I can explain in more detail, but I don't like to waste time on
> ideas that I can't confirm have a solid basis in reality...
>
> -Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

Re: [PATCH 0/5] *** Introduce new space allocation algorithm ***

Re: [PATCH 0/5] * Introduce new space allocation algorithm *