Dmitry Monakhov <dmonakhov@xxxxxxxxxx> writes: Sorry forgot to add linux-ext4@xxxxxxxxxxxxxxx list to CC > Currently each quota modification result in write_dquot() > and later dquot_commit(). This means what each quota modification > function must wait for dqio_mutex. Which is *huge* performance > penalty on big SMP systems. ASAIU The core idea of this implementation > is to guarantee that each quota modification will be written to journal > atomically. But in fact this is not always true, because dquot may be > changed after dquot modification, but before it was committed to disk. > > | Task 1 | Task 2 | > | alloc_space(nr) | | > | ->spin_lock(dq_data_lock) | | > | ->curspace += nr | | > | ->spin_unlock(dq_data_lock) | | > | ->mark_dirty() | free_space(nr) | > | -->write_dquot() | ->spin_lock(dq_data_lock) | > | --->dquot_commit() | ->curspace -= nr | > | ---->commit_dqblk() | ->spin_unlock(dq_data_lock) | > | ----->spin_lock(dq_data_lock) | | > | ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value | > | ----->spin_unlock(dq_data_lock) | | > | ----->quota_write() | | > Quota corruption not happens only because quota modification caller > started journal already. And ext3/4 allow only one running quota > at a time. Let's exploit this fact and avoid writing quota each time. > Act similar to dirty_for_io in general write_back path in page-cache. > If we have found what other task already started on copy and write the > dquot then we skip actual quota_write stage. And let that task do the job. > Side effect: Task which skip real quota_write() will not get an error > (if any). But this is not big deal because: > 1) Any error has global consequences (RO_remount, err_print, etc). > 2) Real IO is differed till the journall_commit. > > Signed-off-by: Dmitry Monakhov <dmonakhov@xxxxxxxxxx> > --- > fs/ext4/super.c | 37 ++++++++++++++++++++++++++++++------- > fs/quota/dquot.c | 16 +++++++++++++--- > include/linux/quotaops.h | 1 + > 3 files changed, 44 insertions(+), 10 deletions(-) > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index 3929091..164217e 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -3777,14 +3777,37 @@ static int ext4_release_dquot(struct dquot *dquot) > > static int ext4_mark_dquot_dirty(struct dquot *dquot) > { > - /* Are we journaling quotas? */ > - if (EXT4_SB(dquot->dq_sb)->s_qf_names[USRQUOTA] || > - EXT4_SB(dquot->dq_sb)->s_qf_names[GRPQUOTA]) { > - dquot_mark_dquot_dirty(dquot); > - return ext4_write_dquot(dquot); > - } else { > + int ret, err; > + handle_t *handle; > + struct inode *inode; > + > + /* Are we not journaling quotas? */ > + if (!EXT4_SB(dquot->dq_sb)->s_qf_names[USRQUOTA] && > + !EXT4_SB(dquot->dq_sb)->s_qf_names[GRPQUOTA]) > return dquot_mark_dquot_dirty(dquot); > - } > + > + /* journaling quotas case */ > + inode = dquot_to_inode(dquot); > + handle = ext4_journal_start(inode, > + EXT4_QUOTA_TRANS_BLOCKS(dquot->dq_sb)); > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > + if (!__dquot_mark_dquot_dirty(dquot)) > + ret = dquot_commit(dquot); > + else > + /* > + * Dquot was already dirty. This means that other task already > + * started a transaction but not clear dirty bit yet (see > + * dquot_commit). Since the only one running transaction is > + * possible at a time. Then that task belongs to the same > + * transaction. We don'n have to actually write dquot changes > + * because that task will write it for for us. > + */ > + ret = 0; > + err = ext4_journal_stop(handle); > + if (!ret) > + ret = err; > + return ret; > } > > static int ext4_write_info(struct super_block *sb, int type) > diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c > index 6979224..e03eea0 100644 > --- a/fs/quota/dquot.c > +++ b/fs/quota/dquot.c > @@ -311,14 +311,24 @@ static inline int mark_dquot_dirty(struct dquot *dquot) > { > return dquot->dq_sb->dq_op->mark_dirty(dquot); > } > - > -int dquot_mark_dquot_dirty(struct dquot *dquot) > +/* mark dquot dirty in atomic meaner, and return old dirty state */ > +int __dquot_mark_dquot_dirty(struct dquot *dquot) > { > + int ret = 1; > spin_lock(&dq_list_lock); > - if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) > + if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) { > list_add(&dquot->dq_dirty, &sb_dqopt(dquot->dq_sb)-> > info[dquot->dq_type].dqi_dirty_list); > + ret = 0; > + } > spin_unlock(&dq_list_lock); > + return ret; > +} > +EXPORT_SYMBOL(__dquot_mark_dquot_dirty); > + > +int dquot_mark_dquot_dirty(struct dquot *dquot) > +{ > + __dquot_mark_dquot_dirty(dquot); > return 0; > } > EXPORT_SYMBOL(dquot_mark_dquot_dirty); > diff --git a/include/linux/quotaops.h b/include/linux/quotaops.h > index ae7ef25..c950f7d 100644 > --- a/include/linux/quotaops.h > +++ b/include/linux/quotaops.h > @@ -56,6 +56,7 @@ int dquot_commit(struct dquot *dquot); > int dquot_acquire(struct dquot *dquot); > int dquot_release(struct dquot *dquot); > int dquot_commit_info(struct super_block *sb, int type); > +int __dquot_mark_dquot_dirty(struct dquot *dquot); > int dquot_mark_dquot_dirty(struct dquot *dquot); > > int vfs_quota_on(struct super_block *sb, int type, int format_id, -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html