RE: Bluestore performance bottleneck

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Thu, 22 Dec 2016 00:39:02 +0000

Yeah, make sense I missed it..I will remove extents and see how much we can save.
But, why a 4K length/offset is started touching 2 shards now if shard is smaller  is still unclear to me ? 

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx] 
Sent: Wednesday, December 21, 2016 4:21 PM
To: Somnath Roy
Cc: ceph-devel
Subject: RE: Bluestore performance bottleneck

On Thu, 22 Dec 2016, Somnath Roy wrote:
> Sage,
> By reducing shard size I am able to improve bluestore +rocks performance by 80% for 60G image. Will do detailed analysis on bigger images.
> 
> Here is what I changed to reduce decode_some() overhead. It is now looping 5 times instead of default 33.
> 
> bluestore_extent_map_shard_max_size = 50 
> bluestore_extent_map_shard_target_size = 45
> 
> Fio output :
> -------------
>         Default:
>         ----------
> 
>         rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34420: Wed Dec 21 17:30:50 2016
>   write: io=114208MB, bw=63615KB/s, iops=15903, runt=1838384msec
>     slat (usec): min=3, max=3878, avg=15.63, stdev= 8.78
>     clat (usec): min=618, max=176057, avg=20092.56, stdev=21017.34
>      lat (usec): min=642, max=176063, avg=20108.20, stdev=21017.29
>     clat percentiles (usec):
>      |  1.00th=[ 1416],  5.00th=[ 2928], 10.00th=[ 4320], 20.00th=[ 6112],
>      | 30.00th=[ 7648], 40.00th=[ 9152], 50.00th=[10944], 60.00th=[13760],
>      | 70.00th=[18816], 80.00th=[32384], 90.00th=[55552], 95.00th=[68096],
>      | 99.00th=[87552], 99.50th=[94720], 99.90th=[121344], 99.95th=[129536],
>      | 99.99th=[142336]
> 
>              Small shards:
>               ----------------
> 
> rbd_iodepth32: (groupid=0, jobs=5): err= 0: pid=34767: Wed Dec 21 18:50:30 2016
>   write: io=186917MB, bw=110585KB/s, iops=27646, runt=1730819msec
>     slat (usec): min=2, max=1447, avg=14.95, stdev= 7.64
>     clat (usec): min=531, max=541140, avg=11547.88, stdev=8131.72
>      lat (usec): min=544, max=541156, avg=11562.83, stdev=8131.73
>     clat percentiles (msec):
>      |  1.00th=[    3],  5.00th=[    4], 10.00th=[    5], 20.00th=[    6],
>      | 30.00th=[    7], 40.00th=[    9], 50.00th=[   10], 60.00th=[   12],
>      | 70.00th=[   14], 80.00th=[   17], 90.00th=[   22], 95.00th=[   27],
>      | 99.00th=[   37], 99.50th=[   40], 99.90th=[   51], 99.95th=[   75],
>      | 99.99th=[  208]
> 
> 
> 
> *But* here is the overhead I am seeing which I don't quite understand. See the per io metadata overhead for onode/shard is with smaller shard is ~30% more.
> 
> Default:
> ----------
> 
> 2016-12-21 17:22:31.693503 7f18aa7c1700 30 submit_transaction Rocksdb transaction:
> Put( Prefix = M key = 
> 0x000000000000048a'.0000000009.00000000000000019093' Value size = 182) 
> Put( Prefix = M key = 0x000000000000048a'._fastinfo' Value size = 186) 
> Put( Prefix = O key = 
> 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
> 0xfffffffffffffffeffffffffffffffff6f0014c000'x' Value size = 669) Put( 
> Prefix = O key = 
> 0x7f8000000000000001dc3f23'`!rbd_data.10046b8b4567.00000000000024b1!='
> 0xfffffffffffffffeffffffffffffffff'o' Value size = 462) Merge( Prefix 
> = b key = 0x000000069ae00000 Value size = 16) Merge( Prefix = b key = 
> 0x0000001330d00000 Value size = 16)
> 
> 
> Smaller shard:
> -----------------
> 
> 2016-12-21 18:52:18.564423 7f3e5d167700 30 submit_transaction Rocksdb transaction:
> Put( Prefix = M key = 
> 0x0000000000000691'.0000000009.00000000000000057248' Value size = 182) 
> Put( Prefix = M key = 0x0000000000000691'._fastinfo' Value size = 186) 
> Put( Prefix = O key = 
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.000000000000348
> f!='0xfffffffffffffffeffffffffffffffff6f00195000'x' Value size = 45) 
> Put( Prefix = O key = 
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.000000000000348
> f!='0xfffffffffffffffeffffffffffffffff6f0019a000'x' Value size = 45) 
> Put( Prefix = O key = 
> 0x7f8000000000000001427b9fe9217262'd_data.10046b8b4567.000000000000348
> f!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1505) Merge( 
> Prefix = b key = 0x00000006b9780000 Value size = 16) Merge( Prefix = b 
> key = 0x0000000c64f00000 Value size = 16)
> 
> 
> *And* lot of times I am seeing 2 shards are written compare to default. This will be a problem for ZS , may not be for Rocks.
> 
> Initially, I thought blobs are spanning, but, it seems not the case. See the below log snippet , it seems onode itself is bigger now.
> 
> 2016-12-21 18:40:27.734044 7f3e43934700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:d1bc6a86:::rbd_data.10046b8b4567.00000000000036a2:head# is 1505 (1503 bytes onode + 2 bytes spanning blobs + 0 bytes inline extents)
> 
> Any idea what's going on ?

The onode has a list of the shards.  Since there are more, the onode is bigger.  I wasn't really expecting the shard count to be that high.  The structure is:

  struct shard_info {
    uint32_t offset = 0;  ///< logical offset for start of shard
    uint32_t bytes = 0;   ///< encoded bytes
    uint32_t extents = 0; ///< extents
    DENC(shard_info, v, p) {
      denc_varint(v.offset, p);
      denc_varint(v.bytes, p);
      denc_varint(v.extents, p);
    }
    void dump(Formatter *f) const;
  };
  vector<shard_info> extent_map_shards; ///< extent map shards (if any)

The offset is the important piece.  The byte and extent counts aren't that important... they're mostly there so that a future reshard operation can be more clever (merging or splitting adjacent shards instead of resharding everything).  Well, the bytes field is currently used, but extents is not at all.  We could just drop that field now and add it (or something else) back in later if/when we need it...

sage

>
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Friday, December 16, 2016 7:23 PM
> To: Sage Weil (sweil@xxxxxxxxxx)
> Cc: 'ceph-devel'
> Subject: RE: Bluestore performance bottleneck
> 
> Sage,
> Some update on this. Without decode_some within fault_range() I am able to drive Bluestore + rocksdb close to ~38K iops compare to ~20K iops with decode_some. I had to disable data write because I am skipping decode but in this device data write is not a bottleneck. I have seen enabling/disabling data write is giving similar result. So, in NVME device if we can optimize decode_some() for performance Bluestore performance should bump up by ~2X.
> I did some print around decode_some() and it seems it is taking ~60-121 micro sec to finish depending on bytes to decode.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, December 15, 2016 7:30 PM
> To: Sage Weil (sweil@xxxxxxxxxx)
> Cc: ceph-devel
> Subject: Bluestore performance bottleneck
> 
> Sage,
> Today morning I was talking about 2x performance drop for Bluestore without data/db writes for 1G vs 60G volumes and it turn out the decode_some() is the culprit for that. Presently, I am drilling down that function to identify what exactly causing this issue, but, most probably it is blob decode and le->blob->get_ref() combination. Will confirm that soon. If we can fix that we should be able to considerably bump up end-to-end pick performance with rocks/ZS on faster NVME. Slower devices most likely we will not be able to see any benefits other than saving some cpu cost.
> 
> Thanks & Regards
> Somnath
> 
> 
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html