RE: pg_stat_t is 500+ bytes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sage/Sam,
I was going through the pg_info_t structures and it seems the following fields are duplicates among the different inner structures.

eversion_t last_scrub;                                ->   duplicate within  pg_stat_t, pg_history_t
eversion_t last_deep_scrub;                    -> duplicate within  pg_stat_t, pg_history_t
utime_t last_scrub_stamp;                        -> duplicate within  pg_stat_t, pg_history_t
utime_t last_deep_scrub_stamp;           -> duplicate within  pg_stat_t, pg_history_t
utime_t last_clean_scrub_stamp;            -> duplicate within  pg_stat_t, pg_history_t

pg_stat_t -> created is duplicate of pg_history_t -> epoch_created ?

pg_stat_t->last_became_active is duplicate of pg_stat_t->last_active ?
pg_stat_t->last_became_peered is duplicate of pg_stat_t->last_peered ?

Following quick optimization we could do as well probably..
The following fields of object_stat_sum_t  can have value of 0 and 1 , so, why not replaced with structure bit field ?

  int32_t num_flush_mode_high;  // 1 when in high flush mode, otherwise 0
  int32_t num_flush_mode_low;   // 1 when in low flush mode, otherwise 0
  int32_t num_evict_mode_some;  // 1 when in evict some mode, otherwise 0
  int32_t num_evict_mode_full;  // 1 when in evict full mode, otherwise 0

Same with the following field of pg_info_t , we can replace with bitfield.

bool last_backfill_bitwise

These will replace 93 bytes with 5 bits.

Am I missing anything ?

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Thursday, August 25, 2016 6:29 PM
To: 'Haomai Wang'; Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@xxxxxxxxxxxxxxx
Subject: RE: pg_stat_t is 500+ bytes

Sage,
I can see for statistics we are storing 40 bytes for each IO..

Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)

Should we really store it on each IO ?

1. IMO, gathering/storing stats should be configurable

2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?

BTW, as you mentioned, pg_info is huge, 855 bytes.

Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)


Thanks & Regards
Somnath


-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Haomai Wang
Sent: Thursday, August 25, 2016 5:56 AM
To: Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: pg_stat_t is 500+ bytes

On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xxxxxxxx> wrote:
> I think there exists a lot of fields can't be discarded via judging
> 0s, like utime_t, epoch_t. A simple way is to compare with previous
> pg_info_t.

Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.

>
> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> thread(thread level not process level).
>
> looks we have three optimization from mark and Piotr:
>
> 1. reconstruct pg_info_t and make high frequent fields together.
> 2. change some fields to smaller bits
> 3. uses SIMD to optimize low frequency fields difference
> comparison(optional)
>
> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> comparison, pg_info_t is above 700bytes:
>
> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>
>     for (;; p += 16) {
>         const __m128i s = _mm_load_si128((const __m128i *)p);
>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>
>         if (r != 0) // some of characters isn't equal
>             return p + __builtin_ffs(r) - 1;
>     }
> }
>
> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@xxxxxxxxxxxxxxxx> wrote:
>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>
>>>
>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
>>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847
>>> >>>>bytes, and this _info omap key is rewritten on *every* IO.
>>> >>>>
>>> >>>>We could shrink this down significant with varint and/or delta
>>> >>>>encoding since a huge portion of it is just a bunch of uint64_t
>>> >>>>counters.  This will probably cost some CPU time, but OTOH it'll
>>> >>>>also shrink our metadata down a fair bit too which will pay off later.
>>> >>>>
>>> >>>>Anybody want to tackle this?
>>> >>>
>>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> >>>pg info should be unchanged for most of times, we could only
>>> >>>update object related stats. This may help to reduce half bytes.
>>> >
>>> >I don't think this will work, since every op changes last_update in
>>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>> >
>>> >>Or we only store increment values and keep the full in memory(may
>>> >>reduce to 20-30bytes). In period time we store the full
>>> >>structure(only hundreds of bytes)....
>>> >
>>> >A delta is probably very compressible (only a few fields in the
>>> >stats struct change).  The question is how fast can we make it in CPU time.
>>> >Probably a simple delta (which will be almost all 0's) and a
>>> >trivial run-length-encoding scheme that just gets rid of the 0's
>>> >would do well enough...
>>>
>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>> with in the current encoding?
>>
>> Or how high these counters get? We could try transposing the matrix
>> made of those counters. At least the two most significant bytes in
>> most of those counters are mostly zeros, and after transposing,
>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>> these fields need to be uint64_t.
>>
>> --
>> Piotr Dałek
>> branch@xxxxxxxxxxxxxxxx
>> http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux