Hi, Thanks for the patch! Though I haven't read the code in detail, I have a few minor comments to make. See below. On Fri, Nov 16, 2012 at 4:34 PM, srimugunthan dhandapani <srimugunthan.dhandapani@xxxxxxxxx> wrote: > Hi all, > > Due to fundamental limits like size-per-chip and interface speed > limits all large capacity Flash are made of multiple chips or banks. > The presence of multiple chips readily offers parallel read or write support. > Unlike an SSD, for a raw flash card , this parallelism is visible to > the software layer and there are many opportunities > for exploiting this parallelism. > > The presented LFTL is meant for flash cards with multiple banks and > larger minimum write sizes. > LFTL mostly reuses code from mtd_blkdevs.c and mtdblock.c. > The LFTL was tested on a 512GB raw flash card which has no firmware > for wearlevelling or garbage collection. > > The following are the important points regarding the LFTL: > > 1. multiqueued/multithreaded design:(Thanks to Joern engel for a > mail-discussion) > The mtd_blkdevs.c dequeues block I/O requests from the block layer > provided request queue from a single kthread. > This design of IO requests dequeued from a single queue by a single > thread is a bottleneck for flash cards that supports hundreds of MB/sec. > We use a multiqueued and multithreaded design. > We bypass the block layer by registering a new make_request and > the LFTL maintains several queues of its own and the block IO requests are > put in one of these queues. For every queue there is an associated kthread > that processes requests from that queue. The number of "FTL IO kthreads" > is #defined as 64 currently. > > 2. Block allocation and Garbage collection: > The block allocation algorithm allocates blocks across the banks and > so the multiple FTL IO kthreads can write to different banks > simultaneously. With this design we were able to get upto 300MB/sec( > although some more bandwidth is possible from the device) > Further the block allocation tries to exclude any bank that is being > garbage collected and tries to allocate the block on bank that is > idle. Similar to IO, the garbage collection is performed from several > co-running threads. The garbage collection is started on a bank where > there is no I/O happening. By scheduling the writes and garbage collection > on the different banks, we can considerably avoid garbage collection latency > in the write path. > > > 3. Buffering within FTL : > The LFTL exposes a block size of 4K. The flash card that we used has > a minimum writesize of 32K. So we use several caching buffers inside > the FTL to accumulate the 4Ks to a single 32K and then flush to the flash. > The number of caching buffers is #defined as 256 but they can be > tunable. The per-buffer size is set to page size of 32K. > First the write requests are absorbed in these multiple FTL > cache-buffers before they are flushed to flash device. > We also flush the buffers that are not modified for a considerable > amount of time from a separate background thread. > > 4. Adaptive garbage collection: > The amount of garbage collection that is happening at a particular > instant is controlled by the number of active garbage collecting threads. > The number of active garbage collecting threads is varied according to the > current I/O load on the device. The current IO load level is inferred from > the number of active FTL IO kthreads. During idle times(number of active FTL > IO kthreads = 0), the number of active garbage collecting threads is > increased to maximum. When all FTL IO threads are active, we reduce > the number of active garbage collecting threads to atmost one > > > 5. Checkpointing: > Similar to Yaffs2, LFTL during module unload checkpoints the important > data structures and during reload they are used to reconstruct the FTL > datastructures. The initial device scan is also parallelised and searching for > checkpoint block during module load also happens from multiple > kthreads. The checkpoint information is stored in a chained fashion with one > checkpoint block pointing to the next. The first checkpoint block is > guaranteed to be in the first or last few blocks of any bank. > So the parallel running threads only need to find the first checkpoint block > and we subsequently follow the linked chain. > > > With LFTL we were able to get upto 300MB/sec for sequential workload. > The device actually can support more than 300MB/sec and the LFTL still > needs improvement (and more testing). > > Despite the code being far from perfect, i am sending the patch to > get some initial feedback comments. > Thanks in advance for your inputs for improving the FTL. > -mugunthan > > Signed-off-by: srimugunthan <srimugunthan.dhandapani@xxxxxxxxx> > --- > > diff --git a/drivers/mtd/Kconfig b/drivers/mtd/Kconfig > index 4be8373..c68b5d2 100644 > --- a/drivers/mtd/Kconfig > +++ b/drivers/mtd/Kconfig > @@ -237,6 +237,15 @@ config NFTL > hardware, although under the terms of the GPL you're obviously > permitted to copy, modify and distribute the code as you wish. Just > not use it. > + > +config LFTL > + tristate "LFTL (FTL for parallel IO flash card) support" > + > + ---help--- > + This provides support for the NAND Flash Translation Layer which is > + meant for large capacity Raw flash cards with parallel I/O > + capability > + > > config NFTL_RW > bool "Write support for NFTL" > diff --git a/drivers/mtd/Makefile b/drivers/mtd/Makefile > index 39664c4..8d36339 100644 > --- a/drivers/mtd/Makefile > +++ b/drivers/mtd/Makefile > @@ -19,6 +19,7 @@ obj-$(CONFIG_MTD_BLOCK) += mtdblock.o > obj-$(CONFIG_MTD_BLOCK_RO) += mtdblock_ro.o > obj-$(CONFIG_FTL) += ftl.o > obj-$(CONFIG_NFTL) += nftl.o > +obj-$(CONFIG_LFTL) += lftl.o > obj-$(CONFIG_INFTL) += inftl.o > obj-$(CONFIG_RFD_FTL) += rfd_ftl.o > obj-$(CONFIG_SSFDC) += ssfdc.o > diff --git a/drivers/mtd/lftl.c b/drivers/mtd/lftl.c > new file mode 100644 > index 0000000..7f446e0 > --- /dev/null > +++ b/drivers/mtd/lftl.c > @@ -0,0 +1,6417 @@ > +/* > + * lftl: A FTL for Multibanked flash cards with parallel I/O capability > + * > + * > + * this file heavily reuses code from linux-mtd layer > + * Modified over the files > + * 1. mtdblock.c (authors: David Woodhouse <dwmw2@xxxxxxxxxxxxx> > and Nicolas Pitre <nico@xxxxxxxxxxx>) > + * 2. mtd_blkdevs.c(authors: David Woodhouse <dwmw2@xxxxxxxxxxxxx>) > + * code reuse from from urcu library for lock-free queue (author: > Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>) > + * author of this file: Srimugunthan Dhandapani > <srimugunthan.dhandapani@xxxxxxxxx> > + * > + * follows the same licensing of mtdblock.c and mtd_blkdevs.c > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > + * > + */ > + > +#include <linux/kernel.h> > +#include <linux/slab.h> > +#include <linux/module.h> > +#include <linux/list.h> > +#include <linux/fs.h> > + > +#include <linux/mtd/mtd.h> > +#include <linux/blkdev.h> > +#include <linux/blkpg.h> > +#include <linux/spinlock.h> > +#include <linux/hdreg.h> > +#include <linux/init.h> > +#include <linux/mutex.h> > +#include <linux/kthread.h> > +#include <asm/uaccess.h> > +#include <linux/random.h> > +#include <linux/kthread.h> > +#include "lftl.h" > + > + > + > + > + > + > + > + > + Why these blank lines? > +#define INVALID_VALUE (-1) > + > +#define MAX_FTL_CACHEBUFS 256 > + > +#define STATE_EMPTY 0 > +#define STATE_DIRTY 1 > +#define STATE_FULL 2 > +#define STATE_CLEAN 3 > + > +#define GC_THRESH 100000 > +#define INVALID -1 > +#define RAND_SEL -2 > + > +#define ACCEPTABLE_THRESH 256 > + > +#define INVALID_PAGE_NUMBER (0xFFFFFFFF) > + > + > +#define INVALID_CACHE_NUM 0x7F > +#define INVALID_SECT_NUM 0xFFFFFF > + > +#define NO_BUF_FOR_USE -1 > +#define NUM_FREE_BUFS_THRESH 5 > + > + > + More blank lines here... and there are many more. They make the code ugly, at least to me. > +#define BLK_BITMAP_SIZE 4096 > + > + > +#define GC_NUM_TOBE_SCHEDULED 2 > + > +#define DATA_BLK 1 > +#define MAP_BLK 2 > +#define FREE_BLK 0xFFFFFFFF > +#define NUM_GC_LEVELS 3 > +#define GC_LEVEL0 0 > +#define GC_LEVEL1 1 > +#define GC_LEVEL2 2 > + > +#define GC_OP 0 > +#define WR_OP 1 > +#define PFTCH_OP 2 > +#define RD_OP 3 > + > + > +#define INVALID_PAGE_NUMBER_32 0xFFFFFFFF > + > +#define NUM_GC_THREAD 8 > + > +#define MAX_NUM_PLL_BANKS 64 > + > +#define MAX_PAGES_PER_BLK 64 > + > +#define CKPT_RANGE 10 > + > +uint32_t numpllbanks = MAX_NUM_PLL_BANKS; > + > +module_param(numpllbanks, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); > +MODULE_PARM_DESC(numpllbanks, "Number of parallel bank units in the flash"); > + > +uint32_t first_time = 0; > +module_param(first_time, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); > +MODULE_PARM_DESC(numpllbanks, "boolean value, if the module is loaded > firsttime"); > + > + > +static atomic_t num_gcollected; > +static atomic_t gc_on_writes_collisions; > +static atomic_t num_gc_wakeups; > + > +static atomic_t num_l0_gcollected; > +static atomic_t num_l1_gcollected; > +static atomic_t num_l2_gcollected; > +static atomic_t num_erase_gcollected; > +static atomic_t num_cperase_gcollected; > + > +static atomic_t num_gc_threads; > + > + > + > + > + > + > +static unsigned long *page_bitmap; > +static unsigned long *page_incache_bitmap; > +static unsigned long *maptab_bitmap; > +static unsigned long *gc_map; > +static unsigned long *gc_bankbitmap; > + > + > + > +struct extra_info_struct > +{ > + uint64_t sequence_number; > +}; > + > +struct extra_info_struct *extra_info; > + > +struct ftlcache > +{ > + > + uint8_t cache_state; > + unsigned long cache_offset; > + unsigned long sect_idx; > + unsigned long page_idx; > + uint32_t logic_page; > + long unsigned int written_mask; > + > + > + atomic_t writes_in_progress ; > + atomic_t flush_in_progress; > + atomic_t wait_to_flush; > + unsigned long last_touch; > + > +}__attribute__((packed)); > + > + > + > +struct cur_wr_info{ > + uint32_t first_blk; > + uint32_t last_blk; > + uint32_t last_gc_blk; > + uint32_t blk; > + uint8_t state; > + uint8_t last_wrpage; > + int centroid; > +}; > + > +struct scan_thrd_info > +{ > + struct lftlblk_dev *mtdblk; > + int bank; > +}; > + > + > +struct rw_semaphore map_tabl_lock; > + > +static uint64_t *map_table; > + > +static uint64_t *reverse_map_tab; > +static uint64_t *scanseqnumber; > + > +uint64_t buf_lookup_tab[MAX_FTL_CACHEBUFS]; > + > + > + > +static struct kmem_cache *qnode_cache; > + > +static struct lfq_queue_rcu empty_bufsq; > +static struct lfq_queue_rcu full_bufsq; > + > +static struct lfq_queue_rcu spare_bufQ; > + > +static struct lfq_queue_rcu spare_oobbufQ; > + > + > +void *spare_cache_list_ptr[2*MAX_FTL_CACHEBUFS]; > +void *spare_oobbuf_list_ptr[2*MAX_FTL_CACHEBUFS]; > + > + > +int scheduled_for_gc[GC_NUM_TOBE_SCHEDULED]; > + > + > + > + > + > +struct per_bank_info > +{ > + atomic_t perbank_nfree_blks; > + atomic_t perbank_ndirty_pages; > +}; > + > +struct per_blk_info > +{ > + > + atomic_t num_valid_pages; > + DECLARE_BITMAP(valid_pages_map, MAX_PAGES_PER_BLK ); > +}; > + > + > +struct oob_data > +{ > + char blk_type; /* Status of the block: data pages/map pages/unused */ > + uint32_t logic_page_num; > + int32_t seq_number; /* The sequence number of this block */ > + > +}__attribute__((packed)); > + > + > + > + > +struct per_blk_info *blk_info; > + > +struct per_bank_info bank_info[MAX_NUM_PLL_BANKS]; > + > + > + > +struct bank_activity_matrix > +{ > + atomic_t num_reads[MAX_NUM_PLL_BANKS]; > + atomic_t num_writes[MAX_NUM_PLL_BANKS]; > + atomic_t gc_goingon[MAX_NUM_PLL_BANKS]; > + atomic_t num_reads_pref[MAX_NUM_PLL_BANKS]; > +}; > + > + > + > + > + > + > + > +static atomic_t activenumgcthread; > + > + > + > + struct lftlblk_dev; > + > + struct gcthread_arg_data > + { > + int thrdnum; > + struct lftlblk_dev *mtdblk_ptr; > + }; > + > + struct lftlblk_dev { > + struct lftl_blktrans_dev mbd; > + int count; > + unsigned int cache_size; > + atomic_t freeblk_count; > + uint32_t num_blks; > + uint32_t num_cur_wr_blks; > + > + unsigned long *free_blk_map; > + > + uint64_t ckptrd_mask; > + > + uint32_t blksize; > + uint8_t blkshift; > + uint8_t pageshift; > + uint32_t num_parallel_banks; > + uint32_t blks_per_bank; > + uint32_t pages_per_blk; > + uint32_t num_total_pages; > + > + struct cur_wr_info cur_writing[MAX_NUM_PLL_BANKS]; > + struct rw_semaphore cur_wr_state[MAX_NUM_PLL_BANKS]; > + > + > + struct rw_semaphore **free_map_lock; > + > + > + > + struct mutex select_buf_lock; > + > + uint8_t *exper_buf; > + uint8_t *FFbuf; > + int exper_buf_sect_idx; > + struct mutex exper_buf_lock; > + struct mutex flush_buf_lock; > + uint8_t *buf[MAX_FTL_CACHEBUFS]; > + struct mutex buf_lock[MAX_FTL_CACHEBUFS]; > + struct ftlcache cached_buf[MAX_FTL_CACHEBUFS]; > + > + > + struct mutex buf_lookup_tab_mutex; > + uint64_t cache_fullmask; > + > + atomic_t cache_assign_count; > + atomic_t seq_num; > + > + > + struct bank_activity_matrix activity_matrix; > + struct task_struct *bufflushd; > + int gc_thresh[NUM_GC_LEVELS]; > + struct task_struct *ftlgc_thrd[NUM_GC_THREAD]; > + int reserved_blks_per_bank; > + > + int first_ckpt_blk; > + > + > + int hwblks_per_bank; > + unsigned long last_wr_time; > + > + unsigned long *gc_active_map; > + > + int init_not_done; > + struct gcthread_arg_data gcthrd_arg[NUM_GC_THREAD]; > + > + }; > + > + static struct mutex mtdblks_lock; > + > +#define lftl_assert(expr) do { \ > + if (unlikely(!(expr))) { > \ > + printk(KERN_CRIT "lftl: assert failed in %s at %u > (pid %d)\n", \ > + __func__, __LINE__, current->pid); > \ > + dump_stack(); > \ > + } > \ > + } while (0) > + > +extern struct mutex mtd_table_mutex; > +extern struct mtd_info *__mtd_next_device(int i); > + > +#define mtd_for_each_device(mtd) \ > + for ((mtd) = __mtd_next_device(0); \ > + (mtd) != NULL; \ > + (mtd) = __mtd_next_device(mtd->index + 1)) > + > + > +static LIST_HEAD(blktrans_majors); > +static DEFINE_MUTEX(blktrans_ref_mutex); > + > + > +static struct kmem_cache *lftlbiolist_cachep; > +static mempool_t *biolistpool; > +static struct lfq_queue_rcu rcuqu[VIRGO_NUM_MAX_REQ_Q]; > +static uint32_t last_lpn[VIRGO_NUM_MAX_REQ_Q]; > + > +struct bio_node { > + struct lfq_node_rcu list; > + struct rcu_head rcu; > + struct bio *bio; > +}; > + > +#ifdef MAP_TABLE_ONE_LOCK > + #define map_table_lock(pagenum) { down_read(&(map_tabl_lock)); } > + #define map_table_unlock(pagenum) { up_read(&(map_tabl_lock)); } > +#else > + #define map_table_lock(pagenum) do{ while > (test_and_set_bit((pagenum), maptab_bitmap) != 0){ \ > + schedule(); \ > + } \ > + }while(0) > + > + #define map_table_unlock(pagenum) do{ if > (test_and_clear_bit((pagenum), maptab_bitmap) == 0) \ > + { \ > + printk(KERN_ERR "lftl: mapbitmap cleared wrong"); \ > + BUG(); \ > + }\ > + }while(0) > +#endif > + > +void lftl_blktrans_dev_release(struct kref *kref) > +{ > + struct lftl_blktrans_dev *dev = > + container_of(kref, struct lftl_blktrans_dev, ref); > + > + dev->disk->private_data = NULL; > + blk_cleanup_queue(dev->rq); > + put_disk(dev->disk); > + list_del(&dev->list); > + kfree(dev); > +} > + > +static struct lftl_blktrans_dev *lftl_blktrans_dev_get(struct gendisk *disk) > +{ > + struct lftl_blktrans_dev *dev; > + > + mutex_lock(&blktrans_ref_mutex); > + dev = disk->private_data; > + > + if (!dev) > + goto unlock; > + kref_get(&dev->ref); > +unlock: > + mutex_unlock(&blktrans_ref_mutex); > + return dev; > +} > + > +void lftl_blktrans_dev_put(struct lftl_blktrans_dev *dev) > +{ > + mutex_lock(&blktrans_ref_mutex); > + kref_put(&dev->ref, lftl_blktrans_dev_release); > + mutex_unlock(&blktrans_ref_mutex); > +} > + > + > + > + > + > + > + > +void init_device_queues(struct lftl_blktrans_dev *dev) > +{ > + > + int i; > + > + > + lftlbiolist_cachep = kmem_cache_create("mybioQ", > + sizeof(struct lftl_bio_list), 0, SLAB_PANIC, NULL); > + > + biolistpool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, > + mempool_free_slab, lftlbiolist_cachep); > + for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++) > + INIT_LIST_HEAD(&dev->qu[i].qelem_ptr); > + > + > + > + > + > + for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++) > + lfq_init_rcu(&rcuqu[i], call_rcu); > + > + for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++) > + spin_lock_init(&dev->mybioq_lock[i]); > +} > + > +void deinit_device_queues(struct lftl_blktrans_dev *dev) > +{ > + mempool_destroy(biolistpool); > + > + kmem_cache_destroy(lftlbiolist_cachep); > + > +} > + > + > +static int lftl_make_request(struct request_queue *rq, struct bio *bio) > +{ > + > + struct lftl_blktrans_dev *dev; > + int qnum; > + gfp_t gfp_mask; > + struct lftl_bio_list *tmp; > + unsigned long temp_rand; > + > + int i; > + int found; > + > + > + uint32_t lpn; > + > + > + dev = rq->queuedata; > + > + if (dev == NULL) > + goto fail; > + if(bio_data_dir(bio) == WRITE) > + { This is not the correct coding style. Please read Documentation/CodingStyle (and perhaps Documentation/SubmittingPatches) Also, it seems to me you haven't passed this code through ./scripts/checkpatch.pl. Regularly, every patch sent should have no errors or warnings reported by ./scripts/checkpatch.pl. Hope this helps, Ezequiel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html