On Tue, Sep 09 2008, Hugh Dickins wrote: > Hi David, > > I notice 2.6.27-rc5-mm1 and next trees include "discard" support, > so far being used only by fat. Is that really intended for 2.6.28? > > Here's a proposed patch to use discard on swap. But I know nothing of > the tradeoffs, nor what really goes on in such a device, and have none > to try, so this patch may be wide of the mark. > > I started off with the discard_swap() in sys_swapon(): it seemed obvious > to discard all the old swap contents (except the swap header) at that > point (though it would be unfortunate if someone mistakenly does a fresh > boot of a kernel when they mean to resume from swap disk: discarding > would remove the option to back out before swap gets written). I like the idea, it's certainly within the expected use of this. > But I've no idea how long that is liable to take on a device which really > supports discard: discarding an entire partition all at once sounds good > for giving the device the greatest freedom to reorganize itself, > but would that happen quickly enough? That, I think, is still something that we will have to wait and see! I don't know what the timings on this are, presumably the device firmware could go about its business in the background and let the command complete quickly. There's a lot of background stuff in the first place, so I'd be surprised if it wasn't quick. If it's meant for fs usage, it needs to be quick as well. > To do that (and to avoid duplicating the same loop within swapfile.c), > I'd like to change the nr_sects argument to blkdev_issue_discard() from > unsigned to sector_t - unsigned long would be large enough for swap, but > sector_t makes more general sense. And if that change is made, it's > probably right to change sb_issue_discard()'s nr_sects to match? I guess that is needed, if we are going to pass the full device sizes in there. So fine with me. > I also got worried that blkdev_issue_discard() might get stuck in its > loop for a very long time, requests getting freed from the queue before > it quite fills up: so inserted an occasional cond_resched(). Needs experimentation, but yes I agree with the notion of sprinkling some cond_resched()'s in there. > It seems odd to me that the data-less blkdev_issue_discard() is limited > at all by max_hw_sectors; but I'm guessing there's a good reason, safety > perhaps, which has forced you to that. The discard request needs to be turned into a hw command at some point, and for that we still need to fit the offset and size in there. So we are still limited by 32MB commands on sata w/lba48, even though we are not moving any data. Suboptimal, but... > Where else should swap be discarded? I don't want to add anything into > the swap freeing path, and the locking there would be problematic. And > I take it that doing a discard of each single page just before rewriting > it would just be a totally pointless waste of time. > > But the swap allocation algorithm in scan_swap_map() does already try > to locate a free cluster of swap pages (256 of them, not tunable today > but could easily be: 1MB on x86). So I think it makes sense to discard > the old swap when we find a free cluster (and forget about discarding > when we don't find one). That's what I've implemented - but then, > does it still make any sense also to discard all of swap at swapon? I think so - it's akin to mkfs doing a full discard before making the file system, yet the file system should also notify the device when it frees a block. It's a balance of course, don't do it if you are immediately using the block again. > I think that discarding when allocating swap cannot safely use GFP_KERNEL > for its bio_alloc()s: swap_writepage() uses GFP_NOIO, so should be good. > That involves an added gfp_t arg to blkdev_issue_discard() - unless I > copy it to make a swap_issue_discard(), which seems a bad idea. Agree, you need GFP_NOIO there. So the patch to pass in the gfp_t is just fine, in fact anything but the ioctl path does need that. > Here's the proposed patch, or combination of patches: the blkdev and > swap parts should certainly be separated. Advice welcome - thanks! I'll snatch up the blk bits and put them in for-2.6.28. OK if I add your SOB to that? > > Not-yet-Signed-off-by: Hugh Dickins <hugh@xxxxxxxxxxx> > --- > > block/blk-barrier.c | 11 +- > include/linux/blkdev.h | 9 +- > include/linux/swap.h | 15 ++- > mm/swapfile.c | 168 +++++++++++++++++++++++++++++++++++++-- > 4 files changed, 183 insertions(+), 20 deletions(-) > > --- 2.6.27-rc5-mm1/block/blk-barrier.c 2008-09-05 10:05:33.000000000 +0100 > +++ linux/block/blk-barrier.c 2008-09-09 20:00:26.000000000 +0100 > @@ -332,15 +332,17 @@ static void blkdev_discard_end_io(struct > * @bdev: blockdev to issue discard for > * @sector: start sector > * @nr_sects: number of sectors to discard > + * @gfp_mask: memory allocation flags (for bio_alloc) > * > * Description: > * Issue a discard request for the sectors in question. Does not wait. > */ > -int blkdev_issue_discard(struct block_device *bdev, sector_t sector, > - unsigned nr_sects) > +int blkdev_issue_discard(struct block_device *bdev, > + sector_t sector, sector_t nr_sects, gfp_t gfp_mask) > { > struct request_queue *q; > struct bio *bio; > + unsigned int iters = 0; > int ret = 0; > > if (bdev->bd_disk == NULL) > @@ -354,7 +356,10 @@ int blkdev_issue_discard(struct block_de > return -EOPNOTSUPP; > > while (nr_sects && !ret) { > - bio = bio_alloc(GFP_KERNEL, 0); > + if ((++iters & 7) == 0) > + cond_resched(); > + > + bio = bio_alloc(gfp_mask, 0); > if (!bio) > return -ENOMEM; > > --- 2.6.27-rc5-mm1/include/linux/blkdev.h 2008-09-05 10:05:51.000000000 +0100 > +++ linux/include/linux/blkdev.h 2008-09-09 20:00:26.000000000 +0100 > @@ -16,6 +16,7 @@ > #include <linux/bio.h> > #include <linux/module.h> > #include <linux/stringify.h> > +#include <linux/gfp.h> > #include <linux/bsg.h> > #include <linux/smp.h> > > @@ -875,15 +876,15 @@ static inline struct request *blk_map_qu > } > > extern int blkdev_issue_flush(struct block_device *, sector_t *); > -extern int blkdev_issue_discard(struct block_device *, sector_t sector, > - unsigned nr_sects); > +extern int blkdev_issue_discard(struct block_device *, > + sector_t sector, sector_t nr_sects, gfp_t); > > static inline int sb_issue_discard(struct super_block *sb, > - sector_t block, unsigned nr_blocks) > + sector_t block, sector_t nr_blocks) > { > block <<= (sb->s_blocksize_bits - 9); > nr_blocks <<= (sb->s_blocksize_bits - 9); > - return blkdev_issue_discard(sb->s_bdev, block, nr_blocks); > + return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL); > } > > /* > --- 2.6.27-rc5-mm1/include/linux/swap.h 2008-09-05 10:05:52.000000000 +0100 > +++ linux/include/linux/swap.h 2008-09-09 20:00:26.000000000 +0100 > @@ -119,8 +119,9 @@ struct swap_extent { > > enum { > SWP_USED = (1 << 0), /* is slot in swap_info[] used? */ > - SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */ > - SWP_ACTIVE = (SWP_USED | SWP_WRITEOK), > + SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */ > + SWP_DISCARDABLE = (1 << 2), /* blkdev supports discard */ > + SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */ > /* add others here before... */ > SWP_SCANNING = (1 << 8), /* refcount in scan_swap_map */ > }; > @@ -134,22 +135,24 @@ enum { > * The in-memory structure used to track swap areas. > */ > struct swap_info_struct { > - unsigned int flags; > + unsigned long flags; > int prio; /* swap priority */ > + int next; /* next entry on swap list */ > struct file *swap_file; > struct block_device *bdev; > struct list_head extent_list; > struct swap_extent *curr_swap_extent; > - unsigned old_block_size; > - unsigned short * swap_map; > + unsigned short *swap_map; > unsigned int lowest_bit; > unsigned int highest_bit; > + unsigned int lowest_alloc; > + unsigned int highest_alloc; > unsigned int cluster_next; > unsigned int cluster_nr; > unsigned int pages; > unsigned int max; > unsigned int inuse_pages; > - int next; /* next entry on swap list */ > + unsigned int old_block_size; > }; > > struct swap_list_t { > --- 2.6.27-rc5-mm1/mm/swapfile.c 2008-09-05 10:05:54.000000000 +0100 > +++ linux/mm/swapfile.c 2008-09-09 20:00:26.000000000 +0100 > @@ -83,13 +83,93 @@ void swap_unplug_io_fn(struct backing_de > up_read(&swap_unplug_sem); > } > > +/* > + * swapon tell device that all the old swap contents can be discarded, > + * to allow the swap device to optimize its wear-levelling. > + */ > +static int discard_swap(struct swap_info_struct *si) > +{ > + struct swap_extent *se; > + int err = 0; > + > + list_for_each_entry(se, &si->extent_list, list) { > + sector_t start_block = se->start_block << (PAGE_SHIFT - 9); > + pgoff_t nr_blocks = se->nr_pages << (PAGE_SHIFT - 9); > + > + if (se->start_page == 0) { > + /* Do not discard the swap header page! */ > + start_block += 1 << (PAGE_SHIFT - 9); > + nr_blocks -= 1 << (PAGE_SHIFT - 9); > + if (!nr_blocks) > + continue; > + } > + > + err = blkdev_issue_discard(si->bdev, start_block, > + nr_blocks, GFP_KERNEL); > + if (err) > + break; > + > + cond_resched(); > + } > + return err; /* That will often be -EOPNOTSUPP */ > +} > + > +/* > + * swap allocation tell device that a cluster of swap can now be discarded, > + * to allow the swap device to optimize its wear-levelling. > + */ > +static void discard_swap_cluster(struct swap_info_struct *si, > + pgoff_t start_page, pgoff_t nr_pages) > +{ > + struct swap_extent *se = si->curr_swap_extent; > + int found_extent = 0; > + > + while (nr_pages) { > + struct list_head *lh; > + > + if (se->start_page <= start_page && > + start_page < se->start_page + se->nr_pages) { > + pgoff_t offset = start_page - se->start_page; > + sector_t start_block = se->start_block + offset; > + pgoff_t nr_blocks = se->nr_pages - offset; > + > + if (nr_blocks > nr_pages) > + nr_blocks = nr_pages; > + start_page += nr_blocks; > + nr_pages -= nr_blocks; > + > + if (!found_extent++) > + si->curr_swap_extent = se; > + > + start_block <<= PAGE_SHIFT - 9; > + nr_blocks <<= PAGE_SHIFT - 9; > + if (blkdev_issue_discard(si->bdev, start_block, > + nr_blocks, GFP_NOIO)) > + break; > + } > + > + lh = se->list.next; > + if (lh == &si->extent_list) > + lh = lh->next; > + se = list_entry(lh, struct swap_extent, list); > + } > +} > + > +static int wait_for_discard(void *word) > +{ > + schedule(); > + return 0; > +} > + > #define SWAPFILE_CLUSTER 256 > #define LATENCY_LIMIT 256 > > static inline unsigned long scan_swap_map(struct swap_info_struct *si) > { > - unsigned long offset, last_in_cluster; > + unsigned long offset; > + unsigned long last_in_cluster = 0; > int latency_ration = LATENCY_LIMIT; > + int found_free_cluster = 0; > > /* > * We try to cluster swap pages by allocating them sequentially > @@ -102,10 +182,24 @@ static inline unsigned long scan_swap_ma > */ > > si->flags += SWP_SCANNING; > - if (unlikely(!si->cluster_nr)) { > - si->cluster_nr = SWAPFILE_CLUSTER - 1; > - if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) > + if (unlikely(!si->cluster_nr--)) { > + if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { > + si->cluster_nr = SWAPFILE_CLUSTER - 1; > goto lowest; > + } > + if (si->flags & SWP_DISCARDABLE) { > + /* > + * Start range check on racing allocations, in case > + * they overlap the cluster we eventually decide on > + * (we scan without swap_lock to allow preemption). > + * It's hardly conceivable that cluster_nr could be > + * wrapped during our scan, but don't depend on it. > + */ > + if (si->lowest_alloc) > + goto lowest; > + si->lowest_alloc = si->max; > + si->highest_alloc = 0; > + } > spin_unlock(&swap_lock); > > offset = si->lowest_bit; > @@ -118,6 +212,8 @@ static inline unsigned long scan_swap_ma > else if (offset == last_in_cluster) { > spin_lock(&swap_lock); > si->cluster_next = offset-SWAPFILE_CLUSTER+1; > + si->cluster_nr = SWAPFILE_CLUSTER - 1; > + found_free_cluster = 1; > goto cluster; > } > if (unlikely(--latency_ration < 0)) { > @@ -125,11 +221,13 @@ static inline unsigned long scan_swap_ma > latency_ration = LATENCY_LIMIT; > } > } > + > spin_lock(&swap_lock); > + si->cluster_nr = SWAPFILE_CLUSTER - 1; > + si->lowest_alloc = 0; > goto lowest; > } > > - si->cluster_nr--; > cluster: > offset = si->cluster_next; > if (offset > si->highest_bit) > @@ -151,6 +249,60 @@ checks: if (!(si->flags & SWP_WRITEOK)) > si->swap_map[offset] = 1; > si->cluster_next = offset + 1; > si->flags -= SWP_SCANNING; > + > + if (si->lowest_alloc) { > + /* > + * Only set when SWP_DISCARDABLE, and there's a scan > + * for a free cluster in progress or just completed. > + */ > + if (found_free_cluster) { > + /* > + * To optimize wear-levelling, discard the > + * old data of the cluster, taking care not to > + * discard any of its pages that have already > + * been allocated by racing tasks (offset has > + * already stepped over any at the beginning). > + */ > + if (offset < si->highest_alloc && > + si->lowest_alloc <= last_in_cluster) > + last_in_cluster = si->lowest_alloc - 1; > + si->flags |= SWP_DISCARDING; > + spin_unlock(&swap_lock); > + > + if (offset < last_in_cluster) > + discard_swap_cluster(si, offset, > + last_in_cluster - offset + 1); > + > + spin_lock(&swap_lock); > + si->lowest_alloc = 0; > + si->flags &= ~SWP_DISCARDING; > + > + smp_mb(); /* wake_up_bit advises this */ > + wake_up_bit(&si->flags, ilog2(SWP_DISCARDING)); > + > + } else if (si->flags & SWP_DISCARDING) { > + /* > + * Delay using pages allocated by racing tasks > + * until the whole discard has been issued. We > + * could defer that delay until swap_writepage, > + * but it's easier to keep this self-contained. > + */ > + spin_unlock(&swap_lock); > + wait_on_bit(&si->flags, ilog2(SWP_DISCARDING), > + wait_for_discard, TASK_UNINTERRUPTIBLE); > + spin_lock(&swap_lock); > + } else { > + /* > + * Note pages allocated by racing tasks while > + * scan for a free cluster is in progress, so > + * that its final discard can exclude them. > + */ > + if (offset < si->lowest_alloc) > + si->lowest_alloc = offset; > + if (offset > si->highest_alloc) > + si->highest_alloc = offset; > + } > + } > return offset; > } > > @@ -1253,7 +1405,7 @@ asmlinkage long sys_swapoff(const char _ > spin_lock(&swap_lock); > for (type = swap_list.head; type >= 0; type = swap_info[type].next) { > p = swap_info + type; > - if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) { > + if (p->flags & SWP_WRITEOK) { > if (p->swap_file->f_mapping == mapping) > break; > } > @@ -1687,6 +1839,8 @@ asmlinkage long sys_swapon(const char __ > error = -EINVAL; > goto bad_swap; > } > + if (discard_swap(p) == 0) > + p->flags |= SWP_DISCARDABLE; > > mutex_lock(&swapon_mutex); > spin_lock(&swap_lock); > @@ -1696,7 +1850,7 @@ asmlinkage long sys_swapon(const char __ > else > p->prio = --least_priority; > p->swap_map = swap_map; > - p->flags = SWP_ACTIVE; > + p->flags |= SWP_WRITEOK; > nr_swap_pages += nr_good_pages; > total_swap_pages += nr_good_pages; > -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html