Hi Dave, On Wed, Nov 25, 2020 at 09:16:23AM +1100, Dave Chinner wrote: > On Tue, Nov 24, 2020 at 11:51:30PM +0800, Gao Xiang wrote: > > Move the main loop out into a separated function, so we can save > > many extra xfs_perag_put()s and gotoes to make the logic cleaner. > > > > Also it can make the modification of perag protection by some lock > > for shrinking in the future somewhat easier. > > > > Signed-off-by: Gao Xiang <hsiangkao@xxxxxxxxxx> > > --- > > It tries to kill multiple goto exits... which makes the logic hard > > to read and modify. > > > > not quite sure the name of __xfs_dialloc(), cannot think of some > > better name since xfs_dialloc_ag is used... > > > > fs/xfs/libxfs/xfs_ialloc.c | 166 ++++++++++++++++++++----------------- > > 1 file changed, 88 insertions(+), 78 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c > > index 5c8b0210aad3..937455c50570 100644 > > --- a/fs/xfs/libxfs/xfs_ialloc.c > > +++ b/fs/xfs/libxfs/xfs_ialloc.c > > @@ -1681,6 +1681,83 @@ xfs_dialloc_ag( > > return error; > > } > > > > +/* > > + * Return 0 for successfully allocating some inodes in this AG; > > + * 1 for skipping to allocating in the next AG; > > + * < 0 for error code. > > + */ > > +static int > > +__xfs_dialloc( > > FWIW, we try to avoid "__" prefixes when factoring code out. The new > function should be obvious in it's function to have a proper name. > > In this case, we are going to try to allocate an inode in the > provided AG. So xfs_dialloc_try_ag() would seem appropriate. Ok, Thanks! I was asking for a real name :) > > Also, in that case, an -EAGAIN or -EBUSY error might be more > appropriate for a "cannot allocate in this AG" soft failure rather > than having a tri-state return value. We should not get either of > those errors from inode allocation, so such an error would indicate > to the caller that is should just try the next AG... I thought using some error code as well, yet my perference finally about this could be a functional feature (e.g. non-RCU / completely retry by using -ECHILD / -ESTALE in lookup code). For such 2 level internal seperate functions, I'd like to use some tri-state return value in order to avoid breaking some potential or upcoming functional features. Yeah, but it's just my own thought, anyway. I will switch to use some error code in the next version. > > However, I think there's a higher layer cleanup than needs to be > done here first.... > > > @@ -1711,7 +1788,6 @@ xfs_dialloc( > > xfs_ino_t *inop) > > { > > struct xfs_mount *mp = tp->t_mountp; > > - struct xfs_buf *agbp; > > xfs_agnumber_t agno; > > int error; > > bool noroom = false; > > @@ -1726,8 +1802,9 @@ xfs_dialloc( > > * continue where we left off before. In this case, we > > * know that the allocation group has free inodes. > > */ > > - agbp = *IO_agbp; > > - goto out_alloc; > > + error = xfs_dialloc_ag(tp, *IO_agbp, parent, inop); > > + *IO_agbp = NULL; > > + return error; > > } > > This whole "ialloc_context/IO_agbp" thing has always been a nasty > wart in this code. I think getting rid that wart then makes this > change a lot cleaner and more obvious. > > Completely untested patch below for you to start from... Yeah, I didn't think much more about it to really change the logic. My starting point was that I always feel a bit more uncomfortable about randomly add some perag lock/unlock for such like function (with random multiple exits, and xfs_alloc.c extent allocation almost the same...). Hard to review and maybe potential unbalanced bugs here.... Ok, Thanks for the patch/direction! let me play with it for a while, and if you don't care I'll make it together with the series (although the whole progress could be somewhat slow though...) Thanks, Gao Xiang > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx > > > xfs: rework inode allocation interface > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Get rid of the nasty, confusing ialloc_context and messy ENOSPC and > failure handling around xfs_ialloc() and xfs_dialloc() by separating > free inode chunk allocation and inode allocation into two individual > high level operations. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_ialloc.c | 132 +++++++++++++----------- > fs/xfs/libxfs/xfs_ialloc.h | 42 ++++---- > fs/xfs/xfs_inode.c | 245 +++++++++++---------------------------------- > 3 files changed, 151 insertions(+), 268 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c > index 974e71bc4a3a..011813934ee1 100644 > --- a/fs/xfs/libxfs/xfs_ialloc.c > +++ b/fs/xfs/libxfs/xfs_ialloc.c > @@ -1570,7 +1570,7 @@ xfs_dialloc_ag_update_inobt( > * The caller selected an AG for us, and made sure that free inodes are > * available. > */ > -STATIC int > +int > xfs_dialloc_ag( > struct xfs_trans *tp, > struct xfs_buf *agbp, > @@ -1682,35 +1682,70 @@ xfs_dialloc_ag( > return error; > } > > +static int > +xfs_dialloc_roll( > + struct xfs_trans **tpp, > + struct xfs_buf *agibp) > +{ > + struct xfs_trans *tp = *tpp; > + void *dqinfo = NULL; > + unsigned int tflags = 0; > + int error; > + > + /* > + * Hold to on to the agibp across the commit so no other allocation can > + * come in and take the free inodes we just allocated for our caller. > + */ > + xfs_trans_bhold(tp, agibp); > + > + /* > + * We want the quota changes to be associated with the next transaction, > + * NOT this one. So, detach the dqinfo from this and attach it to the > + * next transaction. > + */ > + if (tp->t_dqinfo) { > + dqinfo = tp->t_dqinfo; > + tp->t_dqinfo = NULL; > + tflags = tp->t_flags & XFS_TRANS_DQ_DIRTY; > + tp->t_flags &= ~(XFS_TRANS_DQ_DIRTY); > + } > + > + error = xfs_trans_roll(&tp); > + > + /* Re-attach the quota info that we detached from prev trx. */ > + if (dqinfo) { > + tp->t_dqinfo = dqinfo; > + tp->t_flags |= tflags; > + } > + > + *tpp = tp; > + if (error) { > + xfs_buf_relse(agibp); > + return error; > + } > + xfs_trans_bjoin(tp, agibp); > + return 0; > +} > + > /* > - * Allocate an inode on disk. > + * Select and prepare an AG for inode allocation. > * > - * Mode is used to tell whether the new inode will need space, and whether it > - * is a directory. > + * Mode is used to tell whether the new inode is a directory and hence where to > + * locate it. > * > - * This function is designed to be called twice if it has to do an allocation > - * to make more free inodes. On the first call, *IO_agbp should be set to NULL. > - * If an inode is available without having to performn an allocation, an inode > - * number is returned. In this case, *IO_agbp is set to NULL. If an allocation > - * needs to be done, xfs_dialloc returns the current AGI buffer in *IO_agbp. > - * The caller should then commit the current transaction, allocate a > - * new transaction, and call xfs_dialloc() again, passing in the previous value > - * of *IO_agbp. IO_agbp should be held across the transactions. Since the AGI > - * buffer is locked across the two calls, the second call is guaranteed to have > - * a free inode available. > - * > - * Once we successfully pick an inode its number is returned and the on-disk > - * data structures are updated. The inode itself is not read in, since doing so > - * would break ordering constraints with xfs_reclaim. > + * This function will ensure that the selected AG has free inodes available to > + * allocate from. The selected AGI will be returned locked to the caller, and it > + * will allocate more free inodes if required. If no free inodes are found or > + * can be allocated, no AGI will be returned. > */ > int > -xfs_dialloc( > - struct xfs_trans *tp, > +xfs_dialloc_select_ag( > + struct xfs_trans **tpp, > xfs_ino_t parent, > umode_t mode, > - struct xfs_buf **IO_agbp, > - xfs_ino_t *inop) > + struct xfs_buf **IO_agbp) > { > + struct xfs_trans *tp = *tpp; > struct xfs_mount *mp = tp->t_mountp; > struct xfs_buf *agbp; > xfs_agnumber_t agno; > @@ -1722,25 +1757,10 @@ xfs_dialloc( > struct xfs_ino_geometry *igeo = M_IGEO(mp); > int okalloc = 1; > > - if (*IO_agbp) { > - /* > - * If the caller passes in a pointer to the AGI buffer, > - * continue where we left off before. In this case, we > - * know that the allocation group has free inodes. > - */ > - agbp = *IO_agbp; > - goto out_alloc; > - } > - > - /* > - * We do not have an agbp, so select an initial allocation > - * group for inode allocation. > - */ > + *IO_agbp = NULL; > start_agno = xfs_ialloc_ag_select(tp, parent, mode); > - if (start_agno == NULLAGNUMBER) { > - *inop = NULLFSINO; > + if (start_agno == NULLAGNUMBER) > return 0; > - } > > /* > * If we have already hit the ceiling of inode blocks then clear > @@ -1773,7 +1793,7 @@ xfs_dialloc( > if (!pag->pagi_init) { > error = xfs_ialloc_pagi_init(mp, tp, agno); > if (error) > - goto out_error; > + break; > } > > /* > @@ -1788,11 +1808,12 @@ xfs_dialloc( > */ > error = xfs_ialloc_read_agi(mp, tp, agno, &agbp); > if (error) > - goto out_error; > + break; > > if (pag->pagi_freecount) { > xfs_perag_put(pag); > - goto out_alloc; > + *IO_agbp = agbp; > + return 0; > } > > if (!okalloc) > @@ -1802,46 +1823,39 @@ xfs_dialloc( > error = xfs_ialloc_ag_alloc(tp, agbp, &ialloced); > if (error) { > xfs_trans_brelse(tp, agbp); > - > if (error != -ENOSPC) > - goto out_error; > - > - xfs_perag_put(pag); > - *inop = NULLFSINO; > - return 0; > + break; > } > > if (ialloced) { > /* > - * We successfully allocated some inodes, return > - * the current context to the caller so that it > - * can commit the current transaction and call > - * us again where we left off. > + * We successfully allocated some inodes, so roll the > + * transaction and return the locked AGI buffer to the > + * caller so they can allocate one of the free inodes we > + * just prepared for them. > */ > ASSERT(pag->pagi_freecount > 0); > xfs_perag_put(pag); > > + error = xfs_dialloc_roll(&tp, agbp); > + *tpp = tp; > + if (error) > + return error; > *IO_agbp = agbp; > - *inop = NULLFSINO; > return 0; > } > > nextag_relse_buffer: > - xfs_trans_brelse(tp, agbp); > + xfs_trans_brelse(*tpp, agbp); > nextag: > xfs_perag_put(pag); > if (++agno == mp->m_sb.sb_agcount) > agno = 0; > if (agno == start_agno) { > - *inop = NULLFSINO; > return noroom ? -ENOSPC : 0; > } > } > > -out_alloc: > - *IO_agbp = NULL; > - return xfs_dialloc_ag(tp, agbp, parent, inop); > -out_error: > xfs_perag_put(pag); > return error; > } > diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h > index 72b3468b97b1..d8de4b9f6603 100644 > --- a/fs/xfs/libxfs/xfs_ialloc.h > +++ b/fs/xfs/libxfs/xfs_ialloc.h > @@ -37,30 +37,26 @@ xfs_make_iptr(struct xfs_mount *mp, struct xfs_buf *b, int o) > * Mode is used to tell whether the new inode will need space, and whether > * it is a directory. > * > - * To work within the constraint of one allocation per transaction, > - * xfs_dialloc() is designed to be called twice if it has to do an > - * allocation to make more free inodes. If an inode is > - * available without an allocation, agbp would be set to the current > - * agbp and alloc_done set to false. > - * If an allocation needed to be done, agbp would be set to the > - * inode header of the allocation group and alloc_done set to true. > - * The caller should then commit the current transaction and allocate a new > - * transaction. xfs_dialloc() should then be called again with > - * the agbp value returned from the previous call. > - * > - * Once we successfully pick an inode its number is returned and the > - * on-disk data structures are updated. The inode itself is not read > - * in, since doing so would break ordering constraints with xfs_reclaim. > - * > - * *agbp should be set to NULL on the first call, *alloc_done set to FALSE. > + * There are two phases to inode allocation: selecting an AG and ensuring > + * that it contains free inodes, followed by allocating one of the free > + * inodes. xfs_dialloc_select_ag() does the former and returns a locked AGI > + * to the caller, ensuring that followup call to xfs_dialloc_ag() will > + * have free inodes to allocate from. xfs_dialloc_ag() will return the inode > + * number of the free inode we allocated. > */ > -int /* error */ > -xfs_dialloc( > - struct xfs_trans *tp, /* transaction pointer */ > - xfs_ino_t parent, /* parent inode (directory) */ > - umode_t mode, /* mode bits for new inode */ > - struct xfs_buf **agbp, /* buf for a.g. inode header */ > - xfs_ino_t *inop); /* inode number allocated */ > +int > +xfs_dialloc_select_ag( > + struct xfs_trans **tpp, > + xfs_ino_t parent, > + umode_t mode, > + struct xfs_buf **agibp); > + > +int > +xfs_dialloc_ag( > + struct xfs_trans *tp, > + struct xfs_buf *agbp, > + xfs_ino_t parent, > + xfs_ino_t *inop); > > /* > * Free disk inode. Carefully avoids touching the incore inode, all > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 2bfbcf28b1bd..2a06ca4dee2b 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -761,68 +761,28 @@ xfs_inode_inherit_flags2( > } > > /* > - * Allocate an inode on disk and return a copy of its in-core version. > - * The in-core inode is locked exclusively. Set mode, nlink, and rdev > - * appropriately within the inode. The uid and gid for the inode are > - * set according to the contents of the given cred structure. > - * > - * Use xfs_dialloc() to allocate the on-disk inode. If xfs_dialloc() > - * has a free inode available, call xfs_iget() to obtain the in-core > - * version of the allocated inode. Finally, fill in the inode and > - * log its initial contents. In this case, ialloc_context would be > - * set to NULL. > - * > - * If xfs_dialloc() does not have an available inode, it will replenish > - * its supply by doing an allocation. Since we can only do one > - * allocation within a transaction without deadlocks, we must commit > - * the current transaction before returning the inode itself. > - * In this case, therefore, we will set ialloc_context and return. > - * The caller should then commit the current transaction, start a new > - * transaction, and call xfs_ialloc() again to actually get the inode. > - * > - * To ensure that some other process does not grab the inode that > - * was allocated during the first call to xfs_ialloc(), this routine > - * also returns the [locked] bp pointing to the head of the freelist > - * as ialloc_context. The caller should hold this buffer across > - * the commit and pass it back into this routine on the second call. > - * > - * If we are allocating quota inodes, we do not have a parent inode > - * to attach to or associate with (i.e. pip == NULL) because they > - * are not linked into the directory structure - they are attached > - * directly to the superblock - and so have no parent. > + * Initialise a newly allocated inode and return the in-core inode to the > + * caller locked exclusively. > */ > static int > xfs_ialloc( > - xfs_trans_t *tp, > - xfs_inode_t *pip, > - umode_t mode, > - xfs_nlink_t nlink, > - dev_t rdev, > - prid_t prid, > - xfs_buf_t **ialloc_context, > - xfs_inode_t **ipp) > -{ > - struct xfs_mount *mp = tp->t_mountp; > - xfs_ino_t ino; > - xfs_inode_t *ip; > - uint flags; > - int error; > - struct timespec64 tv; > - struct inode *inode; > + struct xfs_trans *tp, > + struct xfs_inode *pip, > + umode_t mode, > + xfs_nlink_t nlink, > + dev_t rdev, > + prid_t prid, > + struct xfs_inode **ipp) > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_inode *ip; > + xfs_ino_t ino; > + uint flags; > + int error; > + struct timespec64 tv; > + struct inode *inode; > > - /* > - * Call the space management code to pick > - * the on-disk inode to be allocated. > - */ > - error = xfs_dialloc(tp, pip ? pip->i_ino : 0, mode, > - ialloc_context, &ino); > - if (error) > - return error; > - if (*ialloc_context || ino == NULLFSINO) { > - *ipp = NULL; > - return 0; > - } > - ASSERT(*ialloc_context == NULL); > + *ipp = NULL; > > /* > * Protect against obviously corrupt allocation btree records. Later > @@ -837,12 +797,10 @@ xfs_ialloc( > } > > /* > - * Get the in-core inode with the lock held exclusively. > - * This is because we're setting fields here we need > - * to prevent others from looking at until we're done. > + * Get the in-core inode with the lock held exclusively to prevent > + * others from looking at until we're done. > */ > - error = xfs_iget(mp, tp, ino, XFS_IGET_CREATE, > - XFS_ILOCK_EXCL, &ip); > + error = xfs_iget(mp, tp, ino, XFS_IGET_CREATE, XFS_ILOCK_EXCL, &ip); > if (error) > return error; > ASSERT(ip != NULL); > @@ -932,142 +890,57 @@ xfs_ialloc( > } > > /* > - * Allocates a new inode from disk and return a pointer to the > - * incore copy. This routine will internally commit the current > - * transaction and allocate a new one if the Space Manager needed > - * to do an allocation to replenish the inode free-list. > - * > - * This routine is designed to be called from xfs_create and > - * xfs_create_dir. > + * Allocates a new inode from disk and return a pointer to the incore copy. This > + * routine will internally commit the current transaction and allocate a new one > + * if we needed to allocate more on-disk free inodes to perform the requested > + * operation. > * > + * If we are allocating quota inodes, we do not have a parent inode to attach to > + * or associate with (i.e. dp == NULL) because they are not linked into the > + * directory structure - they are attached directly to the superblock - and so > + * have no parent. > */ > int > xfs_dir_ialloc( > - xfs_trans_t **tpp, /* input: current transaction; > - output: may be a new transaction. */ > - xfs_inode_t *dp, /* directory within whose allocate > - the inode. */ > - umode_t mode, > - xfs_nlink_t nlink, > - dev_t rdev, > - prid_t prid, /* project id */ > - xfs_inode_t **ipp) /* pointer to inode; it will be > - locked. */ > -{ > - xfs_trans_t *tp; > - xfs_inode_t *ip; > - xfs_buf_t *ialloc_context = NULL; > - int code; > - void *dqinfo; > - uint tflags; > - > - tp = *tpp; > - ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > + struct xfs_trans **tpp, > + struct xfs_inode *dp, > + umode_t mode, > + xfs_nlink_t nlink, > + dev_t rdev, > + prid_t prid, > + struct xfs_inode **ipp) > +{ > + struct xfs_buf *agibp = NULL; > + xfs_ino_t pino = dp ? dp->i_ino : 0; > + xfs_ino_t ino; > + int error; > > - /* > - * xfs_ialloc will return a pointer to an incore inode if > - * the Space Manager has an available inode on the free > - * list. Otherwise, it will do an allocation and replenish > - * the freelist. Since we can only do one allocation per > - * transaction without deadlocks, we will need to commit the > - * current transaction and start a new one. We will then > - * need to call xfs_ialloc again to get the inode. > - * > - * If xfs_ialloc did an allocation to replenish the freelist, > - * it returns the bp containing the head of the freelist as > - * ialloc_context. We will hold a lock on it across the > - * transaction commit so that no other process can steal > - * the inode(s) that we've just allocated. > - */ > - code = xfs_ialloc(tp, dp, mode, nlink, rdev, prid, &ialloc_context, > - &ip); > + ASSERT(*tpp->t_flags & XFS_TRANS_PERM_LOG_RES); > > /* > - * Return an error if we were unable to allocate a new inode. > - * This should only happen if we run out of space on disk or > - * encounter a disk error. > + * Call the space management code to pick the on-disk inode to be > + * allocated. > */ > - if (code) { > - *ipp = NULL; > - return code; > - } > - if (!ialloc_context && !ip) { > + error = xfs_dialloc_select_ag(tpp, pino, mode, &agibp); > + if (error) > + return error; > + if (!agibp) { > *ipp = NULL; > - return -ENOSPC; > + return 0; > } > > - /* > - * If the AGI buffer is non-NULL, then we were unable to get an > - * inode in one operation. We need to commit the current > - * transaction and call xfs_ialloc() again. It is guaranteed > - * to succeed the second time. > - */ > - if (ialloc_context) { > - /* > - * Normally, xfs_trans_commit releases all the locks. > - * We call bhold to hang on to the ialloc_context across > - * the commit. Holding this buffer prevents any other > - * processes from doing any allocations in this > - * allocation group. > - */ > - xfs_trans_bhold(tp, ialloc_context); > - > - /* > - * We want the quota changes to be associated with the next > - * transaction, NOT this one. So, detach the dqinfo from this > - * and attach it to the next transaction. > - */ > - dqinfo = NULL; > - tflags = 0; > - if (tp->t_dqinfo) { > - dqinfo = (void *)tp->t_dqinfo; > - tp->t_dqinfo = NULL; > - tflags = tp->t_flags & XFS_TRANS_DQ_DIRTY; > - tp->t_flags &= ~(XFS_TRANS_DQ_DIRTY); > - } > - > - code = xfs_trans_roll(&tp); > - > - /* > - * Re-attach the quota info that we detached from prev trx. > - */ > - if (dqinfo) { > - tp->t_dqinfo = dqinfo; > - tp->t_flags |= tflags; > - } > - > - if (code) { > - xfs_buf_relse(ialloc_context); > - *tpp = tp; > - *ipp = NULL; > - return code; > - } > - xfs_trans_bjoin(tp, ialloc_context); > - > - /* > - * Call ialloc again. Since we've locked out all > - * other allocations in this allocation group, > - * this call should always succeed. > - */ > - code = xfs_ialloc(tp, dp, mode, nlink, rdev, prid, > - &ialloc_context, &ip); > - > - /* > - * If we get an error at this point, return to the caller > - * so that the current transaction can be aborted. > - */ > - if (code) { > - *tpp = tp; > - *ipp = NULL; > - return code; > - } > - ASSERT(!ialloc_context && ip); > + /* Allocate an inode from the selected AG */ > + error = xfs_dialloc_ag(*tpp, agibp, pino, &ino); > + if (error) > + return error; > + ASSERT(ino != NULLFSINO); > > + /* Initialise the newly allocated inode. */ > + error = xfs_ialloc(*tpp, dp, mode, nlink, rdev, prid, ipp); > + if (error) { > + *ipp = NULL; > + return error; > } > - > - *ipp = ip; > - *tpp = tp; > - > return 0; > } > >