Re: [PATCH] ext4: journalled quota seed-up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jan Kara <jack@xxxxxxx> writes:

>   Hi,
>
> On Thu 17-12-09 04:15:34, Dmitry Monakhov wrote:
>>  Currently each quota modification result in write_dquot()
>>  and later dquot_commit().  This means what each quota modification
>>  function must wait for dqio_mutex. Which is *huge* performance
>>  penalty on big SMP systems. ASAIU The core idea of this implementation
>>  is to guarantee that each quota modification will be written to journal
>>  atomically. But in fact this is not always true, because dquot may be
>>  changed after dquot modification, but before it was committed to disk.
>> 
>>  | Task 1                           | Task 2                      |
>>  | alloc_space(nr)                  |                             |
>>  | ->spin_lock(dq_data_lock)        |                             |
>>  | ->curspace += nr                 |                             |
>>  | ->spin_unlock(dq_data_lock)      |                             |
>>  | ->mark_dirty()                   | free_space(nr)              |
>>  | -->write_dquot()                 | ->spin_lock(dq_data_lock)   |
>>  | --->dquot_commit()               | ->curspace -= nr            |
>>  | ---->commit_dqblk()              | ->spin_unlock(dq_data_lock) |
>>  | ----->spin_lock(dq_data_lock)    |                             |
>>  | ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value      |
>>  | ----->spin_unlock(dq_data_lock)  |                             |
>>  | ----->quota_write()              |                             |
>> Quota corruption not happens only because quota modification caller
>> started journal already. And ext3/4 allow only one running quota
>> at a time.
>   It was actually designed to work this way. Ext3/4 handles for example
> inodes like this as well... You're right that it might be good to document
> this fact as in future someone might come with a filesystem with a
> different scheme.
>
>> Let's exploit this fact and avoid writing quota each time.
>> Act similar to dirty_for_io in general write_back path in page-cache.
>> If we have found what other task already started on copy and write the
>> dquot then we skip actual quota_write stage. And let that task do the job.
>> Side effect: Task which skip real quota_write() will not get an error
>> (if any). But this is not big deal because:
>>  1) Any error has global consequences (RO_remount, err_print, etc).
>>  2) Real IO is differed till the journall_commit.
> ...
>> +	if (!__dquot_mark_dquot_dirty(dquot))
>> +		ret = dquot_commit(dquot);
>   Hmm, but dquot_commit() clears the dirty bit just after acquiring
> dqio_mutex so if I get it right, you save something only if someone is
> waiting contended on dqio_mutex. So isn't this just working around a
> contention on that mutex (and in a not very nice way)? For case where dquot
> is already active (DQ_ACTIVE_B set), we can certainly make the locking much
> more lightweight - essentially just a per-dquot lock for the time until we
> copy the data from dquot to buffer_head...
Yes this patch is aimed to decrease contention on dqio_mutex only, 
because it is the biggest performance consumer. The patch is not a
complete. This is just a proof of concept patch. And according to my
measurements, quota overhead becomes approximate to
(1 / nr_concurrent_task) which is not perfect. And off-course avoiding
dqio_mutex at all is really good.
>   Also we could make writing dquot more lightweight (have ext4 specific
> part of dquot and cache there buffer head which carries the dquot). Even
> more, we could track look whether the dquot is already part of the
> transaction and save get_write_access etc for that case (but that already
> starts to be ugly)...
When i've thought about this i come in to idea to introduce callback
which dquot_write() may attach to the journal, and call it at
the last journal_stop() :)
But your idea about BH tracking is much better.
>
>> +	else
>> +		/*
>> +		 * Dquot was already dirty. This means that other task already
>> +		 * started a transaction but not clear dirty bit yet (see
>> +		 * dquot_commit). Since the only one running transaction is
>> +		 * possible at a time. Then that task belongs to the same
>> +		 * transaction. We don'n have to actually write dquot changes
>> +		 * because that task will write it for for us.
>> +		 */
>> +		ret = 0;
>> +	err = ext4_journal_stop(handle);
>> +	if (!ret)
>> +		ret = err;
>> +	return ret;
>>  }
>>  
>>  static int ext4_write_info(struct super_block *sb, int type)
>> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
>> index 6979224..e03eea0 100644
>> --- a/fs/quota/dquot.c
>> +++ b/fs/quota/dquot.c
>> @@ -311,14 +311,24 @@ static inline int mark_dquot_dirty(struct dquot *dquot)
>>  {
>>  	return dquot->dq_sb->dq_op->mark_dirty(dquot);
>>  }
>> -
>> -int dquot_mark_dquot_dirty(struct dquot *dquot)
>> +/* mark dquot dirty in atomic meaner, and return old dirty state */
>> +int __dquot_mark_dquot_dirty(struct dquot *dquot)
>>  {
>> +	int ret = 1;
>>  	spin_lock(&dq_list_lock);
>> -	if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
>> +	if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) {
>>  		list_add(&dquot->dq_dirty, &sb_dqopt(dquot->dq_sb)->
>>  				info[dquot->dq_type].dqi_dirty_list);
>> +		ret = 0;
>> +	}
>>  	spin_unlock(&dq_list_lock);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(__dquot_mark_dquot_dirty);
>> +
>> +int dquot_mark_dquot_dirty(struct dquot *dquot)
>> +{
>> +	__dquot_mark_dquot_dirty(dquot);
>>  	return 0;
>>  }
>   Please just do the change do dquot_mark_dquot_dirty(). There's no need
> for the new function.
>
> 								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux