Re: [patch 3/4 v3]swap: make swap discard async

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Wed, 20 Mar 2013 13:51:31 -0700

On Thu, 21 Feb 2013 10:18:00 +0800 Shaohua Li <shli@xxxxxxxxxx> wrote:

> swap can do cluster discard for SSD, which is good, but there are some problems
> here:
> 1. swap do the discard just before page reclaim gets a swap entry and writes
> the disk sectors. This is useless for high end SSD, because an overwrite to a
> sector implies a discard to original nand flash too. A discard + overwrite ==
> overwrite.
> 2. the purpose of doing discard is to improve SSD firmware garbage collection.
> Doing discard just before write doesn't help, because the interval between
> discard and write is too short. Doing discard async and just after a swap entry
> is freed can make the interval longer, so SSD firmware has more time to do gc.
> 3. block discard is a sync API, which will delay scan_swap_map() significantly.
> 4. Write and discard command can be executed parallel in PCIe SSD. Making
> swap discard async can make execution more efficiently.
> 
> This patch makes swap discard async, and move discard to where swap entry is
> freed. Idealy we should do discard for any freed sectors, but some SSD discard
> is very slow. This patch still does discard for a whole cluster. 
> 
> My test does a several round of 'mmap, write, unmap', which will trigger a lot
> of swap discard. In a fusionio card, with this patch, the test runtime is
> reduced to 18% of the time without it, so around 5.5x faster.
> 
>  
> @@ -202,6 +196,71 @@ static int wait_for_discard(void *word)
>  	do { info = (cluster_flag(info) << 24) | (n); } while (0)
>  #define cluster_is_free(info) (cluster_flag(info) & CLUSTER_FLAG_FREE)
>  
> +static int swap_cluster_check_discard(struct swap_info_struct *si,
> +		unsigned int idx)

This is a poor name.  Check what?  And return what value if the
unspecified-thing-we're-checking meets the unspecified conditions?

Consider this simplified example:

	bool check_temperature(...);

versus

	bool temperature_too_high(...);

The latter is a better name, because it tells the reader what the
function's return value means.

The fact that both swap_cluster_check_discard() and its caller are
totally uncommented just worsens things.

> +{
> +
> +	if (!(si->flags & SWP_DISCARDABLE))
> +		return 0;
> +	/*
> +	 * If scan_swap_map() can't find a free cluster, it will check
> +	 * si->swap_map directly. To make sure discarding cluster isn't taken,

"discarding cluster isn't taken" is unclear.  Please rephrase that?

> +	 * mark the swap entries bad (occupied). It will be cleared after
> +	 * discard
> +	 */
> +	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> +			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> +
> +	if (si->discard_cluster_head == CLUSTER_NULL) {
> +		si->discard_cluster_head = idx;
> +		si->discard_cluster_tail = idx;
> +	} else {
> +		cluster_set_next(si->cluster_info[si->discard_cluster_tail],
> +			idx);
> +		si->discard_cluster_tail = idx;
> +	}
> +
> +	schedule_work(&si->discard_work);
> +	return 1;
> +}
> +
> +static void swap_discard_work(struct work_struct *work)
> +{
> +	struct swap_info_struct *si = container_of(work,
> +		struct swap_info_struct, discard_work);
> +	unsigned int *info = si->cluster_info;
> +	unsigned int idx;

This is a nicer layout:

	struct swap_info_struct *si;
	unsigned int *info;
	unsigned int idx;

	si = container_of(work, struct swap_info_struct, discard_work);
	info = si->cluster_info;

> +	spin_lock(&si->lock);
> +	while (si->discard_cluster_head != CLUSTER_NULL) {
> +		idx = si->discard_cluster_head;
> +
> +		si->discard_cluster_head = cluster_next(info[idx]);
> +		if (si->discard_cluster_tail == idx) {
> +			si->discard_cluster_tail = CLUSTER_NULL;
> +			si->discard_cluster_head = CLUSTER_NULL;
> +		}
> +		spin_unlock(&si->lock);
> +
> +		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> +				SWAPFILE_CLUSTER);
> +
> +		spin_lock(&si->lock);
> +		cluster_set_flag(info[idx], CLUSTER_FLAG_FREE);
> +		if (si->free_cluster_head == CLUSTER_NULL) {
> +			si->free_cluster_head = idx;
> +			si->free_cluster_tail = idx;
> +		} else {
> +			cluster_set_next(info[si->free_cluster_tail], idx);
> +			si->free_cluster_tail = idx;
> +		}
> +		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> +				0, SWAPFILE_CLUSTER);
> +	}
> +
> +	spin_unlock(&si->lock);
> +}
> +
>  static inline void inc_cluster_info_page(struct swap_info_struct *p,
>  	unsigned int *cluster_info, unsigned long page_nr)
>  {
> @@ -238,6 +297,9 @@ static inline void dec_cluster_info_page
>  		cluster_count(cluster_info[idx]) - 1);
>  
>  	if (cluster_count(cluster_info[idx]) == 0) {
> +		if (swap_cluster_check_discard(p, idx))
> +			return;

Please add a comment.  Explain why this function returns under these
conditions.

>  		cluster_set_flag(cluster_info[idx], CLUSTER_FLAG_FREE);
>  		if (p->free_cluster_head == CLUSTER_NULL) {
>  			p->free_cluster_head = idx;
> @@ -270,7 +332,6 @@ static unsigned long scan_swap_map(struc
>  	unsigned long scan_base;
>  	unsigned long last_in_cluster = 0;
>  	int latency_ration = LATENCY_LIMIT;
> -	int found_free_cluster = 0;
>  
>  	/*
>  	 * We try to cluster swap pages by allocating them sequentially
> @@ -291,34 +352,28 @@ static unsigned long scan_swap_map(struc
>  			si->cluster_nr = SWAPFILE_CLUSTER - 1;
>  			goto checks;
>  		}
> -		if (si->flags & SWP_DISCARDABLE) {
> -			/*
> -			 * Start range check on racing allocations, in case
> -			 * they overlap the cluster we eventually decide on
> -			 * (we scan without swap_lock to allow preemption).
> -			 * It's hardly conceivable that cluster_nr could be
> -			 * wrapped during our scan, but don't depend on it.
> -			 */
> -			if (si->lowest_alloc)
> -				goto checks;
> -			si->lowest_alloc = si->max;
> -			si->highest_alloc = 0;
> -		}
>  check_cluster:
>  		if (si->free_cluster_head != CLUSTER_NULL) {
>  			offset = si->free_cluster_head * SWAPFILE_CLUSTER;
>  			last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
>  			si->cluster_next = offset;
>  			si->cluster_nr = SWAPFILE_CLUSTER - 1;
> -			found_free_cluster = 1;
>  			goto checks;
>  		} else if (si->cluster_info) {
> +			if (si->discard_cluster_head != CLUSTER_NULL) {

Add a comment here explaining what's going on.

> +				spin_unlock(&si->lock);
> +				schedule_work(&si->discard_work);
> +				flush_work(&si->discard_work);

That's a slow way of calling swap_discard work().  Can't we just call
it directly, after a bit of refactoring?

> +				spin_lock(&si->lock);
> +				goto check_cluster;
> +			}
> +
>  			/*
>  			 * Checking free cluster is fast enough, we can do the
>  			 * check every time
>  			 */
>  			si->cluster_nr = 0;
> -			si->lowest_alloc = 0;
>  			goto checks;
>  		}
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>