Performance question when blocks in thin-pool are assume zeroed

Wayne Chou <gmmark12@xxxxxxxxx> · Mon, 12 May 2014 17:03:39 +0800

Hello,

I am using 4 drives to construct a RAID5 and build a thin
volume on it. To get better performance, I use '-Zn' option
in 'lvcreate' to make the thin pool assume all blocks are
already zeroed.  The chunk size in RAID5 and thin-pool are
both 512KB and the stripe_cache_size=4096 on RAID5.

The following is the performance result I got when writes to
a RAID5 device and a thin volume:

dd if=/dev/zero of=/dev/md5 bs=2M count=1000
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB) copied, 6.02630 seconds, 348 MB/s

dd if=/dev/zero of=/dev/mapper/vg1-lv1 bs=2M count=1000
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB) copied, 11.58648 seconds, 181 MB/s

To find out what may cause the performance dropped so much, I
made some traces in codes and finally got some interesting
result. First, the bio size with dd command is 4KB, thus every
128 bios would fill up a thin-block/RAID-chunk in my situation.
Since I have set ‘pool->pf.zero_new_blocks’ = false, it seems
when a new block is provisioned for a bio, this bio would be
put back to the tail of ‘pool->deferred_bios’ list but rather
than issue it immediately. Thus, this made a re-arrangement
for the incoming bio sequences.

For example, the bi_sector of the incoming each ‘PAGE_SIZE’ bios
are:
bi_sector :  [0, 8, 16, 32,......1024]
After each of them got mapped, the orders of issuing to the lower
layer become non-sequential as:
bi_sector :  [8,16,24,…136,144,152, …1016] + [0,128,256,384,512,
640,768,896,1024]

As you can see, the bios which triggered the provision_block()
got re-arranged and separated with other consecutive ones. Thus,
if the lower layer device cannot merge them back, this may cause
some read-modify-writes or seek latency overhead.

According to this observation, I made a rough patch on kernel 3.6
to maintain the sequential order of bios when
‘pool->pf.zero_new_blocks’ = false:

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index b0a5ed9..76cda40 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1321,6 +1321,7 @@ static void schedule_zero(struct thin_c *tc,
dm_block_t virt_block,
 {
  struct pool *pool = tc->pool;
  struct new_mapping *m = get_next_mapping(pool);
+ int r;

  INIT_LIST_HEAD(&m->list);
  m->quiesced = 1;
@@ -1337,9 +1338,20 @@ static void schedule_zero(struct thin_c *tc,
dm_block_t virt_block,
   * zeroing pre-existing data, we can issue the bio immediately.
   * Otherwise we use kcopyd to zero the data first.
   */
- if (!pool->pf.zero_new_blocks)
- process_prepared_mapping(m);
-
+ if (!pool->pf.zero_new_blocks) {
+ r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block, 0);
+ if (r) {
+ DMERR("schedule_zero() failed");
+ cell_error(m->cell);
+ }
+ else {
+ inc_all_io_entry(pool, bio);
+ cell_defer_except(tc, cell);
+ remap_and_issue(tc, bio, data_block);
+ }
+ list_del(&m->list);
+ mempool_free(m, tc->pool->mapping_pool);
+ }
  else if (io_overwrites_block(pool, bio)) {
  struct endio_hook *h = dm_get_mapinfo(bio)->ptr;
  h->overwrite_mapping = m;

And the performance also got better:
dd if=/dev/zero of=/dev/mapper/vg1-lv1 bs=2M count=1000
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB) copied, 6.16819 seconds, 340 MB/s

Suppose my thin-pool is setup with pf->zero_new_blocks = false,
I think it's OK to issue one bio immediately rather than put it
back to the pool->deferred_bios when the mapping is known. Thus,
the sequential order can be maintained in this way. However, I
wonder if I would miss some cases in this rough patch, any
suggestions would be helpful.


Best Regards,

- Wayne.Chou

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel