Dave Chinner <david@xxxxxxxxxxxxx> 于2024年11月21日周四 06:53写道: > > On Sun, Nov 17, 2024 at 09:34:53AM +0800, Stephen Zhang wrote: > > Dave Chinner <david@xxxxxxxxxxxxx> 于2024年11月11日周一 10:04写道: > > > > > > On Fri, Nov 08, 2024 at 09:34:17AM +0800, Stephen Zhang wrote: > > > > Dave Chinner <david@xxxxxxxxxxxxx> 于2024年11月4日周一 20:15写道: > > > > > On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote: > > > > > > Dave Chinner <david@xxxxxxxxxxxxx> 于2024年11月4日周一 11:32写道: > > > > > > > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote: > > > > > > [snip unnecessary stereotyping, accusations and repeated information] > > > > > > > > AFAICT, this "reserve AG space for inodes" behaviour that you are > > > > > trying to acheive is effectively what the inode32 allocator already > > > > > implements. By forcing inode allocation into the AGs below 1TB and > > > > > preventing data from being allocated in those AGs until allocation > > > > > in all the AGs above start failing, it effectively provides the same > > > > > functionality but without the constraints of a global first fit > > > > > allocation policy. > > > > > > > > > > We can do this with any AG by setting it up to prefer metadata, > > > > > but given we already have the inode32 allocator we can run some > > > > > tests to see if setting the metadata-preferred flag makes the > > > > > existing allocation policies do what is needed. > > > > > > > > > > That is, mkfs a new 2TB filesystem with the same 344AG geometry as > > > > > above, mount it with -o inode32 and run the workload that fragments > > > > > all the free space. What we should see is that AGs in the upper TB > > > > > of the filesystem should fill almost to full before any significant > > > > > amount of allocation occurs in the AGs in the first TB of space. > > > > > > Have you performed this experiment yet? > > > > > > I did not ask it idly, and I certainly did not ask it with the intent > > > that we might implement inode32 with AFs. It is fundamentally > > > impossible to implement inode32 with the proposed AF feature. > > > > > > The inode32 policy -requires- top down data fill so that AG 0 is the > > > *last to fill* with user data. The AF first-fit proposal guarantees > > > bottom up fill where AG 0 is the *first to fill* with user data. > > > > > > For example: > > > > > > > So for the inode32 logarithm: > > > > 1. I need to specify a preferred ag, like ag 0: > > > > |---------------------------- > > > > | ag 0 | ag 1 | ag 2 | ag 3 | > > > > +---------------------------- > > > > 2. Someday space will be used up to 100%, Then we have to growfs to ag 7: > > > > +------+------+------+------+------+------+------+------+ > > > > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 | > > > > +------+------+------+------+------+------+------+------+ > > > > 3. specify another ag for inodes again. > > > > 4. repeat 1-3. > > > > > > Lets's assume that AGs are 512GB each and so AGs 0 and 1 fill the > > > entire lower 1TB of the filesystem. Hence if we get to all AGs full > > > the entire inode32 inode allocation space is full. > > > > > > Even if we grow the filesystem at this point, we still *cannot* > > > allocate more inodes in the inode32 space. That space (AGs 0-1) is > > > full even after the growfs. Hence we will still give ENOSPC, and > > > that is -correct behaviour- because the inode32 policy requires this > > > behaviour. > > > > > > IOWs, growfs and changing the AF bounds cannot fix ENOSPC on inode32 > > > when the inode space is exhausted. Only physically moving data out > > > of the lower AGs can fix that problem... > > > > > > > for the AF logarithm: > > > > mount -o af1=1 $dev $mnt > > > > and we are done. > > > > |<-----+ af 0 +----->|<af 1>| > > > > |---------------------------- > > > > | ag 0 | ag 1 | ag 2 | ag 3 | > > > > +---------------------------- > > > > because the af is a relative number to ag_count, so when growfs, it will > > > > become: > > > > |<-----+ af 0 +--------------------------------->|<af 1>| > > > > +------+------+------+------+------+------+------+------+ > > > > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 | > > > > +------+------+------+------+------+------+------+------+ > > > > So just set it once, and run forever. > > > > > > That is actually the general solution to the original problem being > > > reported. I realised this about half way through reading your > > > original proposal. This is why I pointed out inode32 and the > > > preferred metadata mechanism in the AG allocator policies. > > > > > > That is, a general solution should only require the highest AG > > > to be marked as metadata preferred. Then -all- data allocation will > > > then skip over the highest AG until there is no space left in any of > > > the lower AGs. This behaviour will be enforced by the existing AG > > > iteration allocation algorithms without any change being needed. > > > > > > Then when we grow the fs, we set the new highest AG to be metadata > > > preferred, and that space will now be reserved for inodes until all > > > other space is consumed. > > > > > > Do you now understand why I asked you to test whether the inode32 > > > mount option kept the data out of the lower AGs until the higher AGs > > > were completely filled? It's because I wanted confirmation that the > > > metadata preferred flag would do what we need to implement a > > > general solution for the problematic workload. > > > > > > > Hi, I have tested the inode32 mount option. To my suprise, the inode32 > > or the metadata preferred structure (will be referred to as inode32 for the > > rest reply) doesn't implement the desired behavior as the AF rule[1] does: > > Lower AFs/AGs will do anything they can for allocation before going > > to HIGHER/RESERVED AFs/AGS. [1] > > This isn't important or relevant to the experiment I asked you to > perform and report the results of. > > I asked you to observe and report the filesystem fill pattern in > your environment when metadata preferred AGs are enabled. It isn't > important whether inode32 exactly solves your problem, what I want > to know is whether the underlying mechanism has sufficient control > to provide a general solution that is always enabled. > > This is foundational engineering process: check your hypothesis work > as you expect before building more stuff on top of them. i.e. > perform experiments to confirm your ideas will work before doing > anything else. > > If you answer a request for an experiment to be run with "theory > tells me it won't work" then you haven't understood why you were > asked to run an experiment in the first place. > If I understand your reply correctly, then maybe my expression is the problem. What I replied before is: 1. I have tested the inode32 option with the metadata preferred AGs enabled(Yeah, I do check if the AG is set with XFS_AGSTATE_PREFERS_METADATA). And with the alternating- punching pattern, I observed that the preferred AG will still get fragmented quickly, but the AF will not. (That's what I meant in the first sentence of my previous reply...) 2. Then I tried to explain why it doesn't work in theory. Sorry for any misunderstanding because of my unclear reply. Cheers, Shida > If you can't run requested experiments or don't understand why an > expert might be asking for that experiment to be run, then say so. > I can explain in more detail, but I don't like to waste time on > ideas that I can't confirm have a solid basis in reality... > > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx