Hello, I am using 4 drives to construct a RAID5 and build a thin volume on it. To get better performance, I use '-Zn' option in 'lvcreate' to make the thin pool assume all blocks are already zeroed. The chunk size in RAID5 and thin-pool are both 512KB and the stripe_cache_size=4096 on RAID5. The following is the performance result I got when writes to a RAID5 device and a thin volume: dd if=/dev/zero of=/dev/md5 bs=2M count=1000 1000+0 records in 1000+0 records out 2097152000 bytes (2.1 GB) copied, 6.02630 seconds, 348 MB/s dd if=/dev/zero of=/dev/mapper/vg1-lv1 bs=2M count=1000 1000+0 records in 1000+0 records out 2097152000 bytes (2.1 GB) copied, 11.58648 seconds, 181 MB/s To find out what may cause the performance dropped so much, I made some traces in codes and finally got some interesting result. First, the bio size with dd command is 4KB, thus every 128 bios would fill up a thin-block/RAID-chunk in my situation. Since I have set ‘pool->pf.zero_new_blocks’ = false, it seems when a new block is provisioned for a bio, this bio would be put back to the tail of ‘pool->deferred_bios’ list but rather than issue it immediately. Thus, this made a re-arrangement for the incoming bio sequences. For example, the bi_sector of the incoming each ‘PAGE_SIZE’ bios are: bi_sector : [0, 8, 16, 32,......1024] After each of them got mapped, the orders of issuing to the lower layer become non-sequential as: bi_sector : [8,16,24,…136,144,152, …1016] + [0,128,256,384,512, 640,768,896,1024] As you can see, the bios which triggered the provision_block() got re-arranged and separated with other consecutive ones. Thus, if the lower layer device cannot merge them back, this may cause some read-modify-writes or seek latency overhead. According to this observation, I made a rough patch on kernel 3.6 to maintain the sequential order of bios when ‘pool->pf.zero_new_blocks’ = false: diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index b0a5ed9..76cda40 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -1321,6 +1321,7 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block, { struct pool *pool = tc->pool; struct new_mapping *m = get_next_mapping(pool); + int r; INIT_LIST_HEAD(&m->list); m->quiesced = 1; @@ -1337,9 +1338,20 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block, * zeroing pre-existing data, we can issue the bio immediately. * Otherwise we use kcopyd to zero the data first. */ - if (!pool->pf.zero_new_blocks) - process_prepared_mapping(m); - + if (!pool->pf.zero_new_blocks) { + r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block, 0); + if (r) { + DMERR("schedule_zero() failed"); + cell_error(m->cell); + } + else { + inc_all_io_entry(pool, bio); + cell_defer_except(tc, cell); + remap_and_issue(tc, bio, data_block); + } + list_del(&m->list); + mempool_free(m, tc->pool->mapping_pool); + } else if (io_overwrites_block(pool, bio)) { struct endio_hook *h = dm_get_mapinfo(bio)->ptr; h->overwrite_mapping = m; And the performance also got better: dd if=/dev/zero of=/dev/mapper/vg1-lv1 bs=2M count=1000 1000+0 records in 1000+0 records out 2097152000 bytes (2.1 GB) copied, 6.16819 seconds, 340 MB/s Suppose my thin-pool is setup with pf->zero_new_blocks = false, I think it's OK to issue one bio immediately rather than put it back to the pool->deferred_bios when the mapping is known. Thus, the sequential order can be maintained in this way. However, I wonder if I would miss some cases in this rough patch, any suggestions would be helpful. Best Regards, - Wayne.Chou -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel