Re: pg_stat_t is 500+ bytes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I think there exists a lot of fields can't be discarded via judging
0s, like utime_t, epoch_t. A simple way is to compare with previous
pg_info_t.

BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
thread(thread level not process level).

looks we have three optimization from mark and Piotr:

1. reconstruct pg_info_t and make high frequent fields together.
2. change some fields to smaller bits
3. uses SIMD to optimize low frequency fields difference comparison(optional)

intel SSE4.2 pcmpistrm instructive could do very good 128bytes
comparison, pg_info_t is above 700bytes:

inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
    const __m128i w = _mm_load_si128((const __m128i *)p2);

    for (;; p += 16) {
        const __m128i s = _mm_load_si128((const __m128i *)p);
        const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
            _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
            _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));

        if (r != 0) // some of characters isn't equal
            return p + __builtin_ffs(r) - 1;
    }
}

On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@xxxxxxxxxxxxxxxx> wrote:
> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>
>>
>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
>> >>>>this _info omap key is rewritten on *every* IO.
>> >>>>
>> >>>>We could shrink this down significant with varint and/or delta encoding
>> >>>>since a huge portion of it is just a bunch of uint64_t counters.  This
>> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata
>> >>>>down a fair bit too which will pay off later.
>> >>>>
>> >>>>Anybody want to tackle this?
>> >>>
>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>> >>>pg info should be unchanged for most of times, we could only update
>> >>>object related stats. This may help to reduce half bytes.
>> >
>> >I don't think this will work, since every op changes last_update in
>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>> >
>> >>Or we only store increment values and keep the full in memory(may
>> >>reduce to 20-30bytes). In period time we store the full structure(only
>> >>hundreds of bytes)....
>> >
>> >A delta is probably very compressible (only a few fields in the stats
>> >struct change).  The question is how fast can we make it in CPU time.
>> >Probably a simple delta (which will be almost all 0's) and a trivial
>> >run-length-encoding scheme that just gets rid of the 0's would do well
>> >enough...
>>
>> Do we have any rough idea of how many/often consecutive 0s we end up
>> with in the current encoding?
>
> Or how high these counters get? We could try transposing the matrix made of
> those counters. At least the two most significant bytes in most of those
> counters are mostly zeros, and after transposing, simple RLE would be
> feasible. In any case, I'm not sure if *all* of these fields need to be
> uint64_t.
>
> --
> Piotr Dałek
> branch@xxxxxxxxxxxxxxxx
> http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux