Re: pg_stat_t is 500+ bytes

Haomai Wang <haomai@xxxxxxxx> · Thu, 25 Aug 2016 20:56:23 +0800

On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xxxxxxxx> wrote:
> I think there exists a lot of fields can't be discarded via judging
> 0s, like utime_t, epoch_t. A simple way is to compare with previous
> pg_info_t.

Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
also could apply this way.

>
> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> thread(thread level not process level).
>
> looks we have three optimization from mark and Piotr:
>
> 1. reconstruct pg_info_t and make high frequent fields together.
> 2. change some fields to smaller bits
> 3. uses SIMD to optimize low frequency fields difference comparison(optional)
>
> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> comparison, pg_info_t is above 700bytes:
>
> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>
>     for (;; p += 16) {
>         const __m128i s = _mm_load_si128((const __m128i *)p);
>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>
>         if (r != 0) // some of characters isn't equal
>             return p + __builtin_ffs(r) - 1;
>     }
> }
>
> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@xxxxxxxxxxxxxxxx> wrote:
>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>
>>>
>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
>>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
>>> >>>>this _info omap key is rewritten on *every* IO.
>>> >>>>
>>> >>>>We could shrink this down significant with varint and/or delta encoding
>>> >>>>since a huge portion of it is just a bunch of uint64_t counters.  This
>>> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata
>>> >>>>down a fair bit too which will pay off later.
>>> >>>>
>>> >>>>Anybody want to tackle this?
>>> >>>
>>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> >>>pg info should be unchanged for most of times, we could only update
>>> >>>object related stats. This may help to reduce half bytes.
>>> >
>>> >I don't think this will work, since every op changes last_update in
>>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>> >
>>> >>Or we only store increment values and keep the full in memory(may
>>> >>reduce to 20-30bytes). In period time we store the full structure(only
>>> >>hundreds of bytes)....
>>> >
>>> >A delta is probably very compressible (only a few fields in the stats
>>> >struct change).  The question is how fast can we make it in CPU time.
>>> >Probably a simple delta (which will be almost all 0's) and a trivial
>>> >run-length-encoding scheme that just gets rid of the 0's would do well
>>> >enough...
>>>
>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>> with in the current encoding?
>>
>> Or how high these counters get? We could try transposing the matrix made of
>> those counters. At least the two most significant bytes in most of those
>> counters are mostly zeros, and after transposing, simple RLE would be
>> feasible. In any case, I'm not sure if *all* of these fields need to be
>> uint64_t.
>>
>> --
>> Piotr Dałek
>> branch@xxxxxxxxxxxxxxxx
>> http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html