RE: Bluestore performance bottleneck REVISITED

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Sun, 25 Dec 2016 20:32:44 +0000

Doesn't that suggest that there are only 4 shards in the 4K case? That doesn't sound right.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Somnath Roy
> Sent: Sunday, December 25, 2016 12:30 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Mark Nelson
> <mnelson@xxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx>
> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: RE: Bluestore performance bottleneck REVISITED
> 
> Yes, all are fitting into inline_bl (no shard) in case of 16K min alloc if I can
> remember and that's why decode_some() overhead is less. Also, 4 times less
> extent and that makes overhead less. I will generate some stats on this and
> share.
> 
> -----Original Message-----
> From: Allen Samuels
> Sent: Sunday, December 25, 2016 12:08 PM
> To: Somnath Roy; Mark Nelson; Sage Weil
> Cc: ceph-devel
> Subject: RE: Bluestore performance bottleneck REVISITED
> 
> Decode_some won't care how many shards there are, it will only care about
> the size of a shard. But if the number of shards is small enough to be inlined
> in the oNode that will definitely explain what we're seeing.
> 
> Do we know what the shard-counts are for 16KMinAlloc?
> 
> 
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@xxxxxxxxxxx
> 
> 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Sunday, December 25, 2016 10:38 AM
> > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Mark Nelson
> > <mnelson@xxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx>
> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> > Subject: RE: Bluestore performance bottleneck REVISITED
> >
> > I think one of the reason 16K is doing better is the very less
> > decode_some() overhead in case of 16K because in case of 16K min_alloc
> > there is hardly any shard.
> > Also, what I am seeing (and digging more) that read overhead (onode,
> > shard) during write is now having an impact and it is even more while
> > compaction is on. For 16K, there will be one read (only onode) vs 4K
> > min_alloc where there will be two reads (onode, shard).
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -----Original Message-----
> > From: Allen Samuels
> > Sent: Saturday, December 24, 2016 11:45 PM
> > To: Mark Nelson; Sage Weil
> > Cc: Somnath Roy; ceph-devel
> > Subject: RE: Bluestore performance bottleneck REVISITED
> >
> > (This is the first time, I've dug into the details of the RocksDB
> > stats, so it's possible I'm misinterpreting)
> >
> > Even more interesting.
> >
> > The 4K and 16K runs both have the same total amount of compaction
> > traffic. I look at the "sum" row and the Read(GB) and Write(GB)
> > columns and I note that they're almost exactly the same for the 4K and 16K
> runs.
> >
> > If this is correct, then the hypothesis that the 16K Minalloc is
> > faster due to the reduced cost of the compaction because of smaller
> > metadata size is simply false.
> >
> > It is true that the static size of the metadata is much smaller for
> > the 4K run
> > (8.3 GB vs. 2.5GB), but this simply doesn't explain the observed results.
> >
> > When you factor into the equation the higher ingest for the 16K case
> > [ingesting the data and the extra transaction to undo it]  (91GB vs
> > 190GB) with the same amount of compaction (~145GB) traffic, then we
> > need to start thinking about another explanation for the observed data.
> >
> > I'm still digging into the numbers, but I do notice that the # of
> > writes / sync are radically different (4.5 writes/sync vs. 31.2
> > writes/sync), I assume this is effectively the same as the I/O size. I
> > believe that somehow the system is doing a much better job of batching
> > the transactions for 16K vs 4K and what we're really seeing as the
> performance delta between them is related to this.
> >
> > Mark --> Can we look at the Linux iostat data for this, in particular
> > can you look at the average write I/O size? And the # of interrupts /
> > front-end I/O operation?
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >
> >
> > > -----Original Message-----
> > > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> > > Sent: Friday, December 23, 2016 12:05 PM
> > > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil
> > > <sweil@xxxxxxxxxx>
> > > Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-
> > > devel@xxxxxxxxxxxxxxx>
> > > Subject: Re: Bluestore performance bottleneck
> > >
> > >
> > >
> > > On 12/23/2016 01:33 PM, Allen Samuels wrote:
> > > > The two data points you mention (4K / 16K MinAlloc) yield
> > > > interesting
> > > numbers. For 4K, you're seeing 22.5K IOPS at 1300% CPU or 1.7K IOPS
> > > /
> > core.
> > > Yet for 16K you're seeing 25K IOPS at 1000% CPU or 2.5 K IOPS/Core.
> > > Yet, we know that in the main I/O path that the 16K is doing more
> > > work (since it's double-writing the data), but is yielding better
> > > CPU usage overall. We do know that there will be a reduction of
> > > compaction for the 16K case which will save SOME CPU, but I wouldn't
> > > have thought that this would be substantial since the data is all
> > > processed sequentially in rather large blocks (i.e., the CPU cost of
> > > compaction seems
> > to be larger than expected).
> > > >
> > > > Do we know that you're actually capturing a few compaction cycles
> > > > with the
> > > 16K test? If not, that might explain some of the difference.
> > >
> > > I believe so.  Here is a comparison of the 25/50 tests for example.
> > > Interesting that there's so much more data compacted in the 4K
> > > min_alloc tests.
> > >
> > > 4k min_alloc:
> > >
> > > > 2016-12-22 19:33:49.722025 7fb188f21700  3 rocksdb: -------
> > > > DUMPING STATS -------
> > > > 2016-12-22 19:33:49.722029 7fb188f21700  3 rocksdb:
> > > > ** Compaction Stats [default] **
> > > > Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB}
> > > Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec}
> Comp(cnt}
> > > Avg(sec} KeyIn KeyDrop
> > > > ------------------------------------------------------------------
> > > > --
> > > > -----------------------
> > > ---------------------------------------------------------------
> > > >   L0      8/0    1440.37   2.0      0.0     0.0      0.0      63.1     63.1       0.0   0.0      0.0
> > > 183.8       352       400    0.880       0      0
> > > >   L1     15/0     881.27   3.4     82.2    61.7     20.5      30.8     10.4       0.0   0.5
> > 128.5
> > > 48.2       655        38   17.235    462M    25M
> > > >   L2     95/0    5538.28   2.2     55.4     9.0     46.4      51.8      5.3       0.5   5.8
> 54.3
> > > 50.7      1045       136    7.683   1238M    33M
> > > >   L3      7/0     458.47   0.0      0.5     0.4      0.1       0.5      0.4       0.0   1.1     59.5
> > > 59.5         9         7    1.259     12M      1
> > > >  Sum    125/0    8318.40   0.0    138.1    71.2     67.0     146.3     79.3       0.5   2.3
> > > 68.6     72.7      2061       581    3.547   1712M    58M
> > > >  Int      0/0       0.00   0.0     66.2    38.5     27.6      65.6     38.0       0.0   1.9     87.5
> > > 86.8       774       257    3.013    556M    39M
> > > > Uptime(secs): 1953.4 total, 1953.4 interval
> > > > Flush(GB): cumulative 63.137, interval 35.154
> > > > AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files):
> > > > cumulative 0, interval 0
> > > > AddFile(L0 Files): cumulative 0, interval 0
> > > > AddFile(Keys): cumulative 0, interval 0 Cumulative compaction:
> > > > 146.26 GB write, 76.67 MB/s write, 138.13 GB read, 72.41 MB/s
> > > > read,
> > > > 2060.5 seconds Interval compaction: 65.64 GB write, 34.41 MB/s
> > > > write, 66.16 GB read, 34.68 MB/s read, 774.4 seconds
> > > > Stalls(count): 11 level0_slowdown, 11
> > > > level0_slowdown_with_compaction,
> > > > 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> > > > pending_compaction_bytes, 0 slowdown for
> > pending_compaction_bytes,
> > > 0
> > > > memtable_compaction, 0 memtable_slowdown, interval 7 total count
> > > >
> > > > ** DB Stats **
> > > > Uptime(secs): 1953.4 total, 603.3 interval Cumulative writes: 18M
> > > > writes, 251M keys, 18M commit groups, 1.0 writes per commit group,
> > > > ingest: 91.83 GB, 48.14 MB/s Cumulative WAL: 18M writes, 4111K
> > > > syncs,
> > > > 4.52 writes per sync, written: 91.83 GB, 48.14 MB/s Cumulative stall:
> > > > 00:01:7.797 H:M:S, 3.5 percent Interval writes: 10M writes, 69M
> > > > keys, 10M commit groups, 1.0 writes per commit group, ingest:
> > > > 48121.62 MB,
> > > > 79.77 MB/s Interval WAL: 10M writes, 2170K syncs, 4.99 writes per
> > > > sync, written: 46.99 MB, 79.77 MB/s Interval stall: 00:00:20.024
> > > > H:M:S, 3.3 percent
> > >
> > >
> > > 16k min_alloc:
> > >
> > > > 2016-12-23 10:20:03.926747 7fef2993d700  3 rocksdb: -------
> > > > DUMPING STATS -------
> > > > 2016-12-23 10:20:03.926754 7fef2993d700  3 rocksdb:
> > > > ** Compaction Stats [default] **
> > > > Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB}
> > > Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec}
> Comp(cnt}
> > > Avg(sec} KeyIn KeyDrop
> > > > ------------------------------------------------------------------
> > > > --
> > > > -----------------------
> > > ---------------------------------------------------------------
> > > >   L0      3/0     186.38   0.8      0.0     0.0      0.0      49.4     49.4       0.0   0.0      0.0
> > > 179.0       283       805    0.351       0      0
> > > >   L1     13/0     336.75   1.4     80.5    49.2     31.3      41.4     10.2       0.0   0.8
> > 139.7
> > > 71.9       590       135    4.371    399M    53M
> > > >   L2     33/0    1933.96   0.8     62.4     9.4     53.0      54.5      1.4       0.4   5.8
> 72.0
> > > 62.8       887       145    6.120   1039M    70M
> > > >  Sum     49/0    2457.09   0.0    142.9    58.6     84.3     145.3     61.0       0.4   2.9
> > > 83.1     84.5      1760      1085    1.622   1438M   123M
> > > >  Int      0/0       0.00   0.0     61.6    25.1     36.5      61.5     25.0       0.0   2.9     87.6
> > > 87.4       720       466    1.545    586M    56M
> > > > Uptime(secs): 1951.3 total, 1951.3 interval
> > > > Flush(GB): cumulative 49.411, interval 21.131
> > > > AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files):
> > > > cumulative 0, interval 0
> > > > AddFile(L0 Files): cumulative 0, interval 0
> > > > AddFile(Keys): cumulative 0, interval 0 Cumulative compaction:
> > > > 145.30 GB write, 76.25 MB/s write, 142.90 GB read, 74.99 MB/s
> > > > read,
> > > > 1760.2 seconds Interval compaction: 61.47 GB write, 32.26 MB/s
> > > > write, 61.59 GB read, 32.32 MB/s read, 720.0 seconds
> > > > Stalls(count): 0 level0_slowdown, 0
> > > > level0_slowdown_with_compaction,
> > > > 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
> > > > pending_compaction_bytes, 0 slowdown for
> > pending_compaction_bytes,
> > > 0
> > > > memtable_compaction, 0 memtable_slowdown, interval 0 total count
> > > >
> > > > ** DB Stats **
> > > > Uptime(secs): 1951.3 total, 604.4 interval Cumulative writes: 32M
> > > > writes, 260M keys, 32M commit groups, 1.0 writes per commit group,
> > > > ingest: 190.02 GB, 99.72 MB/s Cumulative WAL: 32M writes, 1032K
> > > > syncs,
> > > > 31.20 writes per sync, written: 190.02 GB, 99.72 MB/s Cumulative
> > > > stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 14M writes,
> > > > 99M keys, 14M commit groups, 1.0 writes per commit group, ingest:
> > > > 84136.97 MB, 139.20 MB/s Interval WAL: 14M writes, 268K syncs,
> > > > 52.14 writes per sync, written: 82.17 MB, 139.20 MB/s Interval stall:
> > > > 00:00:0.000 H:M:S, 0.0 percent
> > >
> > > Mark
> > >
> > > >
> > > >
> > > > Allen Samuels
> > > > SanDisk |a Western Digital brand
> > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > > >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> > > >> Sent: Friday, December 23, 2016 9:09 AM
> > > >> To: Sage Weil <sweil@xxxxxxxxxx>
> > > >> Cc: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>; ceph-devel <ceph-
> > > >> devel@xxxxxxxxxxxxxxx>
> > > >> Subject: Re: Bluestore performance bottleneck
> > > >>
> > > >>>> Try this?
> > > >>>>     https://github.com/ceph/ceph/pull/12634
> > > >>>
> > > >>> Looks like this is most likely reducing the memory usage and
> > > >>> increasing performance quite a bit with smaller shard target/max
> > > >>> values.  With
> > > >>> 25/50 I'm seeing more like 2.6GB RSS memory usage and around 13K
> > > >>> iops typically with some (likely rocksdb) stalls.  I'll run
> > > >>> through the tests again.
> > > >>>
> > > >>> Mark
> > > >>>
> > > >>
> > > >> Ok, Ran through tests with both 4k and 16k
> > > >> min_alloc/max_alloc/blob
> > > sizes
> > > >> using master+12629+12634:
> > > >>
> > > >>
> > >
> >
> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZQzdRU3B
> > > >> 1SGZUbDQ
> > > >>
> > > >> Performance is up in all tests and memory consumption is down
> > > (especially in
> > > >> the smaller target/max tests).  It looks like 100/200 is probably
> > > >> the current optimal configuration on my test setup.  4K min_alloc
> > > >> tests hover around 22.5K IOPS with ~1300% CPU usage, and 16K
> > > >> min_alloc tests hover around 25K IOPs with ~1000% CPU usage.  I
> > > >> think it will be worth spending some
> > > time
> > > >> looking at locking in the bitmap allocator given the perf traces.
> > > >> Beyond
> > > that,
> > > >> I'm seeing rocksdb show up quite a bit in the top CPU consuming
> > > functions
> > > >> now, especially CRC32.
> > > >>
> > > >> Mark
> > > >>
> > > >>
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe
> > > >> ceph-devel" in
> > > the
> > > >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > info
> > > at
> > > >> http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> > > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html