Re: pg_stat_t is 500+ bytes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08/26/2016 12:36 AM, Mark Nelson wrote:
On 08/25/2016 10:11 PM, Mark Nelson wrote:


On 08/25/2016 10:04 PM, Mark Nelson wrote:
On 08/25/2016 08:28 PM, Somnath Roy wrote:
Sage,
I can see for statistics we are storing 40 bytes for each IO..

Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)

Should we really store it on each IO ?

1. IMO, gathering/storing stats should be configurable

2. If enabled, I think we can have counters maintained in memory and
persist that to db between some configurable interval ?

BTW, as you mentioned, pg_info is huge, 855 bytes.

Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)

I spent some time on this today.  In pg_stat_t we have:

eversion_t: count 5 (80 bytes)
utime_t: count 13 (92 bytes)
object_stat_collection_t: count 1 (256 bytes)
pg_t: count 1 (20 bytes)
up vector: count 1 (16 bytes for 3x rep)
acting_vector: count 1 (16 bytes for 3x rep)
other junk: 74 bytes

some of these will probably take varint encoding pretty well, while I
suspect others (like the nsec field in utime_t) it won't help much.  I'm
going through now to get a coarse grain look.  I'm not thrilled with the
varint encoding overhead, but it does appear to be effective in this
case.  Right now in a test of master, pg_info_t is consistently 847
bytes on my single-osd test setup.  Switching object_stat_sum_t fields
to varint encoding yields an average size of 630.562 with almost all
sizes either 630 or 631.

256 - (847 - 631) = 40 bytes (vs 256 originally!)

There are 34 fields in pg_info_t, so almost all of them are encoding
into a 1 byte value in this test.  It's possible that lowz encoding
might reduce this further.

oops, there are 34 fields in object_stat_collection_t rather.

Ok, last update until monday since I'm off tomorrow.  I went through and
changed the eversion_t, utime_t, pg_t, and "other junk" to use varint
encoding as well.  This also happened to hit the eversion_t structs in
pg_info_t.  I didn't do the vectors in pg_stat_t or other fields in
pg_stat_t out of laziness, but those would probably shrink pretty nicely.

After the above, the average encode size in my test is now down to
453.244.  I suspect we could pretty reasonably get that down to 300-400
by tweaking things further.  Larger clusters with more OSDs and such
might not shrink as well though.

Note I didn't bother to preserve backwards compatibility by increasing
the versioning which would need to be done (and a little ugly) in any
real implementation of this.

Mark

Ok, I had some tests run over the weekend while I was away and it appears that we indeed are going to pay a price for varint encoding. I ran my test suite where different IO tests are executed (read, write, randread, randwrite, mixed, etc) at different IO sizes going from largest to smallest. I tested both my wip-pg_info_t-varint branch and the version of master it was based off of. I knew bluestore small random writes were already CPU constrained, so I tested that as it's the same set of tests I've been using when doing the bufferlist append testing where we saw a dramatic performance improvement.

Though I only ran through a single iteration of the test suite for each branch, it appears that master is faster than wip-pg_info_t-varint for 16k randwrites and smaller. For 4K random writes, wip-pg_info_t-varint appears to be only about 67% the speed of master.

So the take away I guess is that we can dramatically lower the pg_info_t size with varint encoding but we're going to chew through some CPU to do it and it's going to hurt if we are already CPU constrained. I suspect rather than using varint as a crutch, we need to think very carefully about the size and necessity of these fields, and perhaps other less CPU intensive ways to pack the data. It may be that reverting from varint encoding in bluestore might be beneficial as well, especially with fast NVMe drives.

Mark




Mark



Thanks & Regards
Somnath


-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx
[mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Haomai Wang
Sent: Thursday, August 25, 2016 5:56 AM
To: Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: pg_stat_t is 500+ bytes

On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xxxxxxxx> wrote:
I think there exists a lot of fields can't be discarded via judging
0s, like utime_t, epoch_t. A simple way is to compare with previous
pg_info_t.

Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
also could apply this way.


BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
thread(thread level not process level).

looks we have three optimization from mark and Piotr:

1. reconstruct pg_info_t and make high frequent fields together.
2. change some fields to smaller bits
3. uses SIMD to optimize low frequency fields difference
comparison(optional)

intel SSE4.2 pcmpistrm instructive could do very good 128bytes
comparison, pg_info_t is above 700bytes:

inline const char *skip_same_128_bytes_SIMD(const char* p, const
char* p2) {
    const __m128i w = _mm_load_si128((const __m128i *)p2);

    for (;; p += 16) {
        const __m128i s = _mm_load_si128((const __m128i *)p);
        const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
            _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
            _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));

        if (r != 0) // some of characters isn't equal
            return p + __builtin_ffs(r) - 1;
    }
}

On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek
<branch@xxxxxxxxxxxxxxxx> wrote:
On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:


On 08/24/2016 11:09 AM, Sage Weil wrote:
On Wed, 24 Aug 2016, Haomai Wang wrote:
On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xxxxxxxx>
wrote:
On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@xxxxxxxxxx>
wrote:
This is huge.  It takes the pg_info_t str from 306 bytes to 847
bytes, and this _info omap key is rewritten on *every* IO.

We could shrink this down significant with varint and/or delta
encoding since a huge portion of it is just a bunch of uint64_t
counters.  This will probably cost some CPU time, but OTOH it'll
also shrink our metadata down a fair bit too which will pay off
later.

Anybody want to tackle this?

what about separating "object_stat_collection_t stats" from
pg_stat_t?
pg info should be unchanged for most of times, we could only
update object related stats. This may help to reduce half bytes.

I don't think this will work, since every op changes last_update in
pg_info_t *and* the stats (write op count, total bytes, objects,
etc.).

Or we only store increment values and keep the full in memory(may
reduce to 20-30bytes). In period time we store the full
structure(only hundreds of bytes)....

A delta is probably very compressible (only a few fields in the
stats struct change).  The question is how fast can we make it in
CPU time.
Probably a simple delta (which will be almost all 0's) and a
trivial run-length-encoding scheme that just gets rid of the 0's
would do well enough...

Do we have any rough idea of how many/often consecutive 0s we end up
with in the current encoding?

Or how high these counters get? We could try transposing the matrix
made of those counters. At least the two most significant bytes in
most of those counters are mostly zeros, and after transposing,
simple RLE would be feasible. In any case, I'm not sure if *all* of
these fields need to be uint64_t.

--
Piotr Dałek
branch@xxxxxxxxxxxxxxxx
http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message
is intended only for the use of the designated recipient(s) named
above. If the reader of this message is not the intended recipient,
you are hereby notified that you have received this message in error
and that any review, dissemination, distribution, or copying of this
message is strictly prohibited. If you have received this
communication in error, please notify the sender by telephone or
e-mail (as shown above) immediately and destroy any and all copies of
this message in your possession (whether hard copies or electronically
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux