On Sat, Nov 17, 2012 at 2:04 AM, Ezequiel Garcia <elezegarcia@xxxxxxxxx> wrote: > Hi, > > Thanks for the patch! > > Though I haven't read the code in detail, I have a few minor comments to make. Thanks for the comments. This is my first big patch. I have only sent trivial patches before. Sorry for the mistakes I will resend with formatted patch later. I realize the code is not perfect. There are also some very long functions. For this thread, I request people to kindly ignore code format. Some general comments with respect to design, any reference papers/code i should be aware of, or ideas to improve performance(particularly random IO performance) will be very helpful for me. thanks, mugunthan > > See below. > > On Fri, Nov 16, 2012 at 4:34 PM, srimugunthan dhandapani > <srimugunthan.dhandapani@xxxxxxxxx> wrote: >> Hi all, >> >> Due to fundamental limits like size-per-chip and interface speed >> limits all large capacity Flash are made of multiple chips or banks. >> The presence of multiple chips readily offers parallel read or write support. >> Unlike an SSD, for a raw flash card , this parallelism is visible to >> the software layer and there are many opportunities >> for exploiting this parallelism. >> >> The presented LFTL is meant for flash cards with multiple banks and >> larger minimum write sizes. >> LFTL mostly reuses code from mtd_blkdevs.c and mtdblock.c. >> The LFTL was tested on a 512GB raw flash card which has no firmware >> for wearlevelling or garbage collection. >> >> The following are the important points regarding the LFTL: >> >> 1. multiqueued/multithreaded design:(Thanks to Joern engel for a >> mail-discussion) >> The mtd_blkdevs.c dequeues block I/O requests from the block layer >> provided request queue from a single kthread. >> This design of IO requests dequeued from a single queue by a single >> thread is a bottleneck for flash cards that supports hundreds of MB/sec. >> We use a multiqueued and multithreaded design. >> We bypass the block layer by registering a new make_request and >> the LFTL maintains several queues of its own and the block IO requests are >> put in one of these queues. For every queue there is an associated kthread >> that processes requests from that queue. The number of "FTL IO kthreads" >> is #defined as 64 currently. >> >> 2. Block allocation and Garbage collection: >> The block allocation algorithm allocates blocks across the banks and >> so the multiple FTL IO kthreads can write to different banks >> simultaneously. With this design we were able to get upto 300MB/sec( >> although some more bandwidth is possible from the device) >> Further the block allocation tries to exclude any bank that is being >> garbage collected and tries to allocate the block on bank that is >> idle. Similar to IO, the garbage collection is performed from several >> co-running threads. The garbage collection is started on a bank where >> there is no I/O happening. By scheduling the writes and garbage collection >> on the different banks, we can considerably avoid garbage collection latency >> in the write path. >> >> >> 3. Buffering within FTL : >> The LFTL exposes a block size of 4K. The flash card that we used has >> a minimum writesize of 32K. So we use several caching buffers inside >> the FTL to accumulate the 4Ks to a single 32K and then flush to the flash. >> The number of caching buffers is #defined as 256 but they can be >> tunable. The per-buffer size is set to page size of 32K. >> First the write requests are absorbed in these multiple FTL >> cache-buffers before they are flushed to flash device. >> We also flush the buffers that are not modified for a considerable >> amount of time from a separate background thread. >> >> 4. Adaptive garbage collection: >> The amount of garbage collection that is happening at a particular >> instant is controlled by the number of active garbage collecting threads. >> The number of active garbage collecting threads is varied according to the >> current I/O load on the device. The current IO load level is inferred from >> the number of active FTL IO kthreads. During idle times(number of active FTL >> IO kthreads = 0), the number of active garbage collecting threads is >> increased to maximum. When all FTL IO threads are active, we reduce >> the number of active garbage collecting threads to atmost one >> >> >> 5. Checkpointing: >> Similar to Yaffs2, LFTL during module unload checkpoints the important >> data structures and during reload they are used to reconstruct the FTL >> datastructures. The initial device scan is also parallelised and searching for >> checkpoint block during module load also happens from multiple >> kthreads. The checkpoint information is stored in a chained fashion with one >> checkpoint block pointing to the next. The first checkpoint block is >> guaranteed to be in the first or last few blocks of any bank. >> So the parallel running threads only need to find the first checkpoint block >> and we subsequently follow the linked chain. >> >> >> With LFTL we were able to get upto 300MB/sec for sequential workload. >> The device actually can support more than 300MB/sec and the LFTL still >> needs improvement (and more testing). >> >> Despite the code being far from perfect, i am sending the patch to >> get some initial feedback comments. >> Thanks in advance for your inputs for improving the FTL. >> -mugunthan >> >> Signed-off-by: srimugunthan <srimugunthan.dhandapani@xxxxxxxxx> >> --- >> >> diff --git a/drivers/mtd/Kconfig b/drivers/mtd/Kconfig >> index 4be8373..c68b5d2 100644 >> --- a/drivers/mtd/Kconfig >> +++ b/drivers/mtd/Kconfig >> @@ -237,6 +237,15 @@ config NFTL >> hardware, although under the terms of the GPL you're obviously >> permitted to copy, modify and distribute the code as you wish. Just >> not use it. >> + >> +config LFTL >> + tristate "LFTL (FTL for parallel IO flash card) support" >> + >> + ---help--- >> + This provides support for the NAND Flash Translation Layer which is >> + meant for large capacity Raw flash cards with parallel I/O >> + capability >> + >> >> config NFTL_RW >> bool "Write support for NFTL" >> diff --git a/drivers/mtd/Makefile b/drivers/mtd/Makefile >> index 39664c4..8d36339 100644 >> --- a/drivers/mtd/Makefile >> +++ b/drivers/mtd/Makefile >> @@ -19,6 +19,7 @@ obj-$(CONFIG_MTD_BLOCK) += mtdblock.o >> obj-$(CONFIG_MTD_BLOCK_RO) += mtdblock_ro.o >> obj-$(CONFIG_FTL) += ftl.o >> obj-$(CONFIG_NFTL) += nftl.o >> +obj-$(CONFIG_LFTL) += lftl.o >> obj-$(CONFIG_INFTL) += inftl.o >> obj-$(CONFIG_RFD_FTL) += rfd_ftl.o >> obj-$(CONFIG_SSFDC) += ssfdc.o >> diff --git a/drivers/mtd/lftl.c b/drivers/mtd/lftl.c >> new file mode 100644 >> index 0000000..7f446e0 >> --- /dev/null >> +++ b/drivers/mtd/lftl.c >> @@ -0,0 +1,6417 @@ >> +/* >> + * lftl: A FTL for Multibanked flash cards with parallel I/O capability >> + * >> + * >> + * this file heavily reuses code from linux-mtd layer >> + * Modified over the files >> + * 1. mtdblock.c (authors: David Woodhouse <dwmw2@xxxxxxxxxxxxx> >> and Nicolas Pitre <nico@xxxxxxxxxxx>) >> + * 2. mtd_blkdevs.c(authors: David Woodhouse <dwmw2@xxxxxxxxxxxxx>) >> + * code reuse from from urcu library for lock-free queue (author: >> Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>) >> + * author of this file: Srimugunthan Dhandapani >> <srimugunthan.dhandapani@xxxxxxxxx> >> + * >> + * follows the same licensing of mtdblock.c and mtd_blkdevs.c >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU General Public License as published by >> + * the Free Software Foundation; either version 2 of the License, or >> + * (at your option) any later version. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> + * GNU General Public License for more details. >> + * >> + * You should have received a copy of the GNU General Public License >> + * along with this program; if not, write to the Free Software >> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA >> + * >> + */ >> + >> +#include <linux/kernel.h> >> +#include <linux/slab.h> >> +#include <linux/module.h> >> +#include <linux/list.h> >> +#include <linux/fs.h> >> + >> +#include <linux/mtd/mtd.h> >> +#include <linux/blkdev.h> >> +#include <linux/blkpg.h> >> +#include <linux/spinlock.h> >> +#include <linux/hdreg.h> >> +#include <linux/init.h> >> +#include <linux/mutex.h> >> +#include <linux/kthread.h> >> +#include <asm/uaccess.h> >> +#include <linux/random.h> >> +#include <linux/kthread.h> >> +#include "lftl.h" >> + >> + >> + >> + >> + >> + >> + >> + >> + > > Why these blank lines? > > >> +#define INVALID_VALUE (-1) >> + >> +#define MAX_FTL_CACHEBUFS 256 >> + >> +#define STATE_EMPTY 0 >> +#define STATE_DIRTY 1 >> +#define STATE_FULL 2 >> +#define STATE_CLEAN 3 >> + >> +#define GC_THRESH 100000 >> +#define INVALID -1 >> +#define RAND_SEL -2 >> + >> +#define ACCEPTABLE_THRESH 256 >> + >> +#define INVALID_PAGE_NUMBER (0xFFFFFFFF) >> + >> + >> +#define INVALID_CACHE_NUM 0x7F >> +#define INVALID_SECT_NUM 0xFFFFFF >> + >> +#define NO_BUF_FOR_USE -1 >> +#define NUM_FREE_BUFS_THRESH 5 >> + >> + >> + > > More blank lines here... and there are many more. > They make the code ugly, at least to me. > > >> +#define BLK_BITMAP_SIZE 4096 >> + >> + >> +#define GC_NUM_TOBE_SCHEDULED 2 >> + >> +#define DATA_BLK 1 >> +#define MAP_BLK 2 >> +#define FREE_BLK 0xFFFFFFFF >> +#define NUM_GC_LEVELS 3 >> +#define GC_LEVEL0 0 >> +#define GC_LEVEL1 1 >> +#define GC_LEVEL2 2 >> + >> +#define GC_OP 0 >> +#define WR_OP 1 >> +#define PFTCH_OP 2 >> +#define RD_OP 3 >> + >> + >> +#define INVALID_PAGE_NUMBER_32 0xFFFFFFFF >> + >> +#define NUM_GC_THREAD 8 >> + >> +#define MAX_NUM_PLL_BANKS 64 >> + >> +#define MAX_PAGES_PER_BLK 64 >> + >> +#define CKPT_RANGE 10 >> + >> +uint32_t numpllbanks = MAX_NUM_PLL_BANKS; >> + >> +module_param(numpllbanks, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); >> +MODULE_PARM_DESC(numpllbanks, "Number of parallel bank units in the flash"); >> + >> +uint32_t first_time = 0; >> +module_param(first_time, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); >> +MODULE_PARM_DESC(numpllbanks, "boolean value, if the module is loaded >> firsttime"); >> + >> + >> +static atomic_t num_gcollected; >> +static atomic_t gc_on_writes_collisions; >> +static atomic_t num_gc_wakeups; >> + >> +static atomic_t num_l0_gcollected; >> +static atomic_t num_l1_gcollected; >> +static atomic_t num_l2_gcollected; >> +static atomic_t num_erase_gcollected; >> +static atomic_t num_cperase_gcollected; >> + >> +static atomic_t num_gc_threads; >> + >> + >> + >> + >> + >> + >> +static unsigned long *page_bitmap; >> +static unsigned long *page_incache_bitmap; >> +static unsigned long *maptab_bitmap; >> +static unsigned long *gc_map; >> +static unsigned long *gc_bankbitmap; >> + >> + >> + >> +struct extra_info_struct >> +{ >> + uint64_t sequence_number; >> +}; >> + >> +struct extra_info_struct *extra_info; >> + >> +struct ftlcache >> +{ >> + >> + uint8_t cache_state; >> + unsigned long cache_offset; >> + unsigned long sect_idx; >> + unsigned long page_idx; >> + uint32_t logic_page; >> + long unsigned int written_mask; >> + >> + >> + atomic_t writes_in_progress ; >> + atomic_t flush_in_progress; >> + atomic_t wait_to_flush; >> + unsigned long last_touch; >> + >> +}__attribute__((packed)); >> + >> + >> + >> +struct cur_wr_info{ >> + uint32_t first_blk; >> + uint32_t last_blk; >> + uint32_t last_gc_blk; >> + uint32_t blk; >> + uint8_t state; >> + uint8_t last_wrpage; >> + int centroid; >> +}; >> + >> +struct scan_thrd_info >> +{ >> + struct lftlblk_dev *mtdblk; >> + int bank; >> +}; >> + >> + >> +struct rw_semaphore map_tabl_lock; >> + >> +static uint64_t *map_table; >> + >> +static uint64_t *reverse_map_tab; >> +static uint64_t *scanseqnumber; >> + >> +uint64_t buf_lookup_tab[MAX_FTL_CACHEBUFS]; >> + >> + >> + >> +static struct kmem_cache *qnode_cache; >> + >> +static struct lfq_queue_rcu empty_bufsq; >> +static struct lfq_queue_rcu full_bufsq; >> + >> +static struct lfq_queue_rcu spare_bufQ; >> + >> +static struct lfq_queue_rcu spare_oobbufQ; >> + >> + >> +void *spare_cache_list_ptr[2*MAX_FTL_CACHEBUFS]; >> +void *spare_oobbuf_list_ptr[2*MAX_FTL_CACHEBUFS]; >> + >> + >> +int scheduled_for_gc[GC_NUM_TOBE_SCHEDULED]; >> + >> + >> + >> + >> + >> +struct per_bank_info >> +{ >> + atomic_t perbank_nfree_blks; >> + atomic_t perbank_ndirty_pages; >> +}; >> + >> +struct per_blk_info >> +{ >> + >> + atomic_t num_valid_pages; >> + DECLARE_BITMAP(valid_pages_map, MAX_PAGES_PER_BLK ); >> +}; >> + >> + >> +struct oob_data >> +{ >> + char blk_type; /* Status of the block: data pages/map pages/unused */ >> + uint32_t logic_page_num; >> + int32_t seq_number; /* The sequence number of this block */ >> + >> +}__attribute__((packed)); >> + >> + >> + >> + >> +struct per_blk_info *blk_info; >> + >> +struct per_bank_info bank_info[MAX_NUM_PLL_BANKS]; >> + >> + >> + >> +struct bank_activity_matrix >> +{ >> + atomic_t num_reads[MAX_NUM_PLL_BANKS]; >> + atomic_t num_writes[MAX_NUM_PLL_BANKS]; >> + atomic_t gc_goingon[MAX_NUM_PLL_BANKS]; >> + atomic_t num_reads_pref[MAX_NUM_PLL_BANKS]; >> +}; >> + >> + >> + >> + >> + >> + >> + >> +static atomic_t activenumgcthread; >> + >> + >> + >> + struct lftlblk_dev; >> + >> + struct gcthread_arg_data >> + { >> + int thrdnum; >> + struct lftlblk_dev *mtdblk_ptr; >> + }; >> + >> + struct lftlblk_dev { >> + struct lftl_blktrans_dev mbd; >> + int count; >> + unsigned int cache_size; >> + atomic_t freeblk_count; >> + uint32_t num_blks; >> + uint32_t num_cur_wr_blks; >> + >> + unsigned long *free_blk_map; >> + >> + uint64_t ckptrd_mask; >> + >> + uint32_t blksize; >> + uint8_t blkshift; >> + uint8_t pageshift; >> + uint32_t num_parallel_banks; >> + uint32_t blks_per_bank; >> + uint32_t pages_per_blk; >> + uint32_t num_total_pages; >> + >> + struct cur_wr_info cur_writing[MAX_NUM_PLL_BANKS]; >> + struct rw_semaphore cur_wr_state[MAX_NUM_PLL_BANKS]; >> + >> + >> + struct rw_semaphore **free_map_lock; >> + >> + >> + >> + struct mutex select_buf_lock; >> + >> + uint8_t *exper_buf; >> + uint8_t *FFbuf; >> + int exper_buf_sect_idx; >> + struct mutex exper_buf_lock; >> + struct mutex flush_buf_lock; >> + uint8_t *buf[MAX_FTL_CACHEBUFS]; >> + struct mutex buf_lock[MAX_FTL_CACHEBUFS]; >> + struct ftlcache cached_buf[MAX_FTL_CACHEBUFS]; >> + >> + >> + struct mutex buf_lookup_tab_mutex; >> + uint64_t cache_fullmask; >> + >> + atomic_t cache_assign_count; >> + atomic_t seq_num; >> + >> + >> + struct bank_activity_matrix activity_matrix; >> + struct task_struct *bufflushd; >> + int gc_thresh[NUM_GC_LEVELS]; >> + struct task_struct *ftlgc_thrd[NUM_GC_THREAD]; >> + int reserved_blks_per_bank; >> + >> + int first_ckpt_blk; >> + >> + >> + int hwblks_per_bank; >> + unsigned long last_wr_time; >> + >> + unsigned long *gc_active_map; >> + >> + int init_not_done; >> + struct gcthread_arg_data gcthrd_arg[NUM_GC_THREAD]; >> + >> + }; >> + >> + static struct mutex mtdblks_lock; >> + >> +#define lftl_assert(expr) do { \ >> + if (unlikely(!(expr))) { >> \ >> + printk(KERN_CRIT "lftl: assert failed in %s at %u >> (pid %d)\n", \ >> + __func__, __LINE__, current->pid); >> \ >> + dump_stack(); >> \ >> + } >> \ >> + } while (0) >> + >> +extern struct mutex mtd_table_mutex; >> +extern struct mtd_info *__mtd_next_device(int i); >> + >> +#define mtd_for_each_device(mtd) \ >> + for ((mtd) = __mtd_next_device(0); \ >> + (mtd) != NULL; \ >> + (mtd) = __mtd_next_device(mtd->index + 1)) >> + >> + >> +static LIST_HEAD(blktrans_majors); >> +static DEFINE_MUTEX(blktrans_ref_mutex); >> + >> + >> +static struct kmem_cache *lftlbiolist_cachep; >> +static mempool_t *biolistpool; >> +static struct lfq_queue_rcu rcuqu[VIRGO_NUM_MAX_REQ_Q]; >> +static uint32_t last_lpn[VIRGO_NUM_MAX_REQ_Q]; >> + >> +struct bio_node { >> + struct lfq_node_rcu list; >> + struct rcu_head rcu; >> + struct bio *bio; >> +}; >> + >> +#ifdef MAP_TABLE_ONE_LOCK >> + #define map_table_lock(pagenum) { down_read(&(map_tabl_lock)); } >> + #define map_table_unlock(pagenum) { up_read(&(map_tabl_lock)); } >> +#else >> + #define map_table_lock(pagenum) do{ while >> (test_and_set_bit((pagenum), maptab_bitmap) != 0){ \ >> + schedule(); \ >> + } \ >> + }while(0) >> + >> + #define map_table_unlock(pagenum) do{ if >> (test_and_clear_bit((pagenum), maptab_bitmap) == 0) \ >> + { \ >> + printk(KERN_ERR "lftl: mapbitmap cleared wrong"); \ >> + BUG(); \ >> + }\ >> + }while(0) >> +#endif >> + >> +void lftl_blktrans_dev_release(struct kref *kref) >> +{ >> + struct lftl_blktrans_dev *dev = >> + container_of(kref, struct lftl_blktrans_dev, ref); >> + >> + dev->disk->private_data = NULL; >> + blk_cleanup_queue(dev->rq); >> + put_disk(dev->disk); >> + list_del(&dev->list); >> + kfree(dev); >> +} >> + >> +static struct lftl_blktrans_dev *lftl_blktrans_dev_get(struct gendisk *disk) >> +{ >> + struct lftl_blktrans_dev *dev; >> + >> + mutex_lock(&blktrans_ref_mutex); >> + dev = disk->private_data; >> + >> + if (!dev) >> + goto unlock; >> + kref_get(&dev->ref); >> +unlock: >> + mutex_unlock(&blktrans_ref_mutex); >> + return dev; >> +} >> + >> +void lftl_blktrans_dev_put(struct lftl_blktrans_dev *dev) >> +{ >> + mutex_lock(&blktrans_ref_mutex); >> + kref_put(&dev->ref, lftl_blktrans_dev_release); >> + mutex_unlock(&blktrans_ref_mutex); >> +} >> + >> + >> + >> + >> + >> + >> + >> +void init_device_queues(struct lftl_blktrans_dev *dev) >> +{ >> + >> + int i; >> + >> + >> + lftlbiolist_cachep = kmem_cache_create("mybioQ", >> + sizeof(struct lftl_bio_list), 0, SLAB_PANIC, NULL); >> + >> + biolistpool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, >> + mempool_free_slab, lftlbiolist_cachep); >> + for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++) >> + INIT_LIST_HEAD(&dev->qu[i].qelem_ptr); >> + >> + >> + >> + >> + >> + for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++) >> + lfq_init_rcu(&rcuqu[i], call_rcu); >> + >> + for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++) >> + spin_lock_init(&dev->mybioq_lock[i]); >> +} >> + >> +void deinit_device_queues(struct lftl_blktrans_dev *dev) >> +{ >> + mempool_destroy(biolistpool); >> + >> + kmem_cache_destroy(lftlbiolist_cachep); >> + >> +} >> + >> + >> +static int lftl_make_request(struct request_queue *rq, struct bio *bio) >> +{ >> + >> + struct lftl_blktrans_dev *dev; >> + int qnum; >> + gfp_t gfp_mask; >> + struct lftl_bio_list *tmp; >> + unsigned long temp_rand; >> + >> + int i; >> + int found; >> + >> + >> + uint32_t lpn; >> + >> + >> + dev = rq->queuedata; >> + >> + if (dev == NULL) >> + goto fail; >> + if(bio_data_dir(bio) == WRITE) >> + { > > This is not the correct coding style. > Please read Documentation/CodingStyle (and perhaps > Documentation/SubmittingPatches) > > Also, it seems to me you haven't passed this code through > ./scripts/checkpatch.pl. > Regularly, every patch sent should have no errors or warnings reported > by ./scripts/checkpatch.pl. > > Hope this helps, > > Ezequiel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html