On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xxxxxxxx> wrote: > I think there exists a lot of fields can't be discarded via judging > 0s, like utime_t, epoch_t. A simple way is to compare with previous > pg_info_t. Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way. > > BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg > thread(thread level not process level). > > looks we have three optimization from mark and Piotr: > > 1. reconstruct pg_info_t and make high frequent fields together. > 2. change some fields to smaller bits > 3. uses SIMD to optimize low frequency fields difference comparison(optional) > > intel SSE4.2 pcmpistrm instructive could do very good 128bytes > comparison, pg_info_t is above 700bytes: > > inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) { > const __m128i w = _mm_load_si128((const __m128i *)p2); > > for (;; p += 16) { > const __m128i s = _mm_load_si128((const __m128i *)p); > const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s, > _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | > _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY)); > > if (r != 0) // some of characters isn't equal > return p + __builtin_ffs(r) - 1; > } > } > > On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@xxxxxxxxxxxxxxxx> wrote: >> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote: >>> >>> >>> On 08/24/2016 11:09 AM, Sage Weil wrote: >>> >On Wed, 24 Aug 2016, Haomai Wang wrote: >>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xxxxxxxx> wrote: >>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>> >>>>This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and >>> >>>>this _info omap key is rewritten on *every* IO. >>> >>>> >>> >>>>We could shrink this down significant with varint and/or delta encoding >>> >>>>since a huge portion of it is just a bunch of uint64_t counters. This >>> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata >>> >>>>down a fair bit too which will pay off later. >>> >>>> >>> >>>>Anybody want to tackle this? >>> >>> >>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t? >>> >>>pg info should be unchanged for most of times, we could only update >>> >>>object related stats. This may help to reduce half bytes. >>> > >>> >I don't think this will work, since every op changes last_update in >>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.). >>> > >>> >>Or we only store increment values and keep the full in memory(may >>> >>reduce to 20-30bytes). In period time we store the full structure(only >>> >>hundreds of bytes).... >>> > >>> >A delta is probably very compressible (only a few fields in the stats >>> >struct change). The question is how fast can we make it in CPU time. >>> >Probably a simple delta (which will be almost all 0's) and a trivial >>> >run-length-encoding scheme that just gets rid of the 0's would do well >>> >enough... >>> >>> Do we have any rough idea of how many/often consecutive 0s we end up >>> with in the current encoding? >> >> Or how high these counters get? We could try transposing the matrix made of >> those counters. At least the two most significant bytes in most of those >> counters are mostly zeros, and after transposing, simple RLE would be >> feasible. In any case, I'm not sure if *all* of these fields need to be >> uint64_t. >> >> -- >> Piotr Dałek >> branch@xxxxxxxxxxxxxxxx >> http://blog.predictor.org.pl -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html