On Wed, 4 Jul 2018, Aleksei Gutikov wrote: > > On 07/03/2018 05:55 AM, Sage Weil wrote: > > On Fri, 29 Jun 2018, Aleksei Gutikov wrote: > > > Throughput is 100% the same, just sliced into bigger chunks (rados > > > objects). > > > And this throughput is not high, less than single object per second. And > > > memory stay occupied even after writing stopped. > > > > > > Currently I'm sure that is side effect of sharing buffer::raw object among > > > different buffer::ptr objects. > > > > > > Please, have a look into this dump of ObjectContext::attr_cache of one of > > > context in PrimaryLogPG::object_contexts, made after uploading single 4M > > > object into S3. > > > Notice "_user.rgw.idtag" and "_user.rgw.tail_tag" xattrs, both 44 bytes > > > length, holidng 4194304 bytes buffer::raw object (nref=2). > > > > That is the smoking gun! What version is this? > > Particularly this dump from 12.2.2 > But issue was also reproducible for 12.2.5 and master. I think this will fix it: https://github.com/ceph/ceph/pull/22858 Can you test? (Patch should be a clean cherry-pick to mimic or luminous) sage > > > Thanks! > > sage > > > > > > > > > > > > > "_": buffer::list(len=302, buffer::ptr(0~302 0x559318e74d80 in raw > > > 0x559318e74d80 len 488 nref 1) ), > > > > > > "_user.rgw.acl": buffer::list(len=147, buffer::ptr(448~147 0x55931677c4c0 > > > in > > > raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "_user.rgw.content_type": buffer::list(len=25, buffer::ptr(616~25 > > > 0x55931677c568 in raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "_user.rgw.etag": buffer::list(len=33, buffer::ptr(654~33 0x55931677c58e > > > in > > > raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "_user.rgw.idtag": buffer::list(len=44, buffer::ptr(14~44 0x55931958e00e > > > in > > > raw 0x55931958e000 len 4194304 nref 2) ), > > > > > > "_user.rgw.manifest": buffer::list(len=300, buffer::ptr(136~300 > > > 0x55931677c388 > > > in raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "_user.rgw.pg_ver": buffer::list(len=8, buffer::ptr(0~8 0x559319124000 in > > > raw > > > 0x559319124000 len 4008 nref 1) ), > > > > > > "_user.rgw.source_zone": buffer::list(len=4, buffer::ptr(1122~4 > > > 0x55931677c762 > > > in raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "_user.rgw.tail_tag": buffer::list(len=44, buffer::ptr(75~44 > > > 0x55931958e04b in > > > raw 0x55931958e000 len 4194304 nref 2) ), > > > > > > "_user.rgw.x-amz-content-sha256": buffer::list(len=65, buffer::ptr(716~65 > > > 0x55931677c5cc in raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "_user.rgw.x-amz-date": buffer::list(len=17, buffer::ptr(800~17 > > > 0x55931677c620 > > > in raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "_user.rgw.x-amz-meta-s3cmd-attrs": buffer::list(len=173, > > > buffer::ptr(848~173 > > > 0x55931677c650 in raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "_user.rgw.x-amz-storage-class": buffer::list(len=9, buffer::ptr(1049~9 > > > 0x55931677c719 in raw 0x55931677c300 len 1126 nref 9) ), > > > > > > "snapset": buffer::list(len=35, buffer::ptr(0~35 0x559319127000 in raw > > > 0x559319127000 len 4008 nref 1) ) > > > > > > > > > Theoretically with 300 pg per osd and EC 8+3 and > > > osd_pg_object_context_cache_count=64 > > > and rgw_obj_stripe_size=4M > > > this cache can consume up to 300/11*64*4M = 6.9G > > > just because of this side effect of shared buffer::raw. > > > We not see so high used memory just because rgw not set xattrs on > > > all rados objects parts of big S3 object. > > > But with synthetic test with all s3 objects of size 4M it can be easily > > > achieved. > > > > > > > > > Thanks, > > > Aleksei > > > > > > > > > On 06/29/2018 03:30 AM, Gregory Farnum wrote: > > > > Can you talk more about how you identified this as an issue and came > > > > up with the potential solutions you've identified? > > > > > > > > Naively, if I'm told that larger objects make the OSD take up more > > > > memory, it sounds to me like the OSD is probably providing more > > > > throughput, and that if you want it to use up less memory you just > > > > ought to change the amount of outstanding IO it lets in to the system. > > > > -Greg > > > > > > > > On Thu, Jun 28, 2018 at 1:29 AM, Aleksei Gutikov > > > > <aleksey.gutikov@xxxxxxxxxx> wrote: > > > > > > > > > > NOTE: rgw_max_chunk_size must be equal to rgw_obj_stripe_size, so I > > > > > mean > > > > > both when refer to one. > > > > > > > > > > For example when I changed rgw_obj_stripe_size from 4M to 16M OSD > > > > > memory > > > > > usage increased approx 2.5 times. > > > > > This issue was reproduced with erasure-coded pools. > > > > > > > > > > OSD command dump_mempools show that only anon pool bytes increased. > > > > > > > > > > Further investigations show that whole buffer::raw object received > > > > > from > > > > > network > > > > > (created in alloc_aligned_buffer() in AsyncConnection.cc:623) > > > > > The whole 4M or 16M buffer::raw objects preserved with nref>0 in > > > > > PrimaryLogPG::object_contexts > > > > > in ObjectContext::attr_cache. > > > > > > > > > > This issue was reproduced on both luminous and master branches. > > > > > > > > > > > > > > > I see at least two types of improvement: > > > > > > > > > > 1) memcpy relatively small parts of buffer::raw when create new > > > > > buffer::ptr > > > > > For just example with next compile-time configuration parameters: > > > > > BUFFER_MIN_SIZE_COPY_FROM = 64k > > > > > BUFFER_MAX_SIZE_TO_COPY = 16k > > > > > BUFFER_MIN_RATIO_TO_COPY = 128 > > > > > will copy up to 512 bytes from 64k raw object > > > > > or will copy up to 16k from 4M object > > > > > will not copy from 63k raw object > > > > > > > > > > Pros: will improve all issues of this type (preservation of > > > > > buffer::raw > > > > > objects) > > > > > Cons: unknown impact, memory fragmentation for example > > > > > > > > > > 2) Improvements related particularly to PrimaryLogPG::object_contexts > > > > > > > > > > 2.1) Set osd_pg_object_context_cache_count into 1 or 0 > > > > > Cons: cache will not actually work > > > > > > > > > > 2.2) Recreate bufferlists of attr_cache entries during inserting > > > > > into > > > > > cache to copy attrs and free huge buffer later. > > > > > Pros: minimal impact on any other subsystems > > > > > Cons: will improve only this particular case > > > > > > > > > > 2.3) Limit object_contexts with total used memory also in addition > > > > > to > > > > > osd_pg_object_context_cache_count. > > > > > Cons: cache will probably will not work because each entry will > > > > > occupy > > > > > lot of memory and all entries will be skipped. > > > > > > > > > > 2.4) Remove object_contexts completely, create contexts every time > > > > > on > > > > > fly. > > > > > Cons: object_contexts not looks like spare part that can be > > > > > safely > > > > > removed > > > > > > > > > > > > > > > We tested osd_pg_object_context_cache_count=1 as hotfix > > > > > and it improved OSD memory usage significantly without dependency from > > > > > rgw_obj_stripe_size. > > > > > > > > > > > > > > > Can, please, somebody clarify a little bit about purpose of > > > > > PrimaryLogPG::object_contexts. > > > > > And, maybe suggest something about fixing this issue. > > > > > > > > > > > > > > > -- > > > > > > > > > > Best regards, > > > > > Aleksei Gutikov > > > > > Software Engineer | synesis.ru | Minsk. BY > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > > in > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > > > > Best regards, > > > Aleksei Gutikov > > > Software Engineer | synesis.ru | Minsk. BY > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Best regards, > Aleksei Gutikov > Software Engineer | synesis.ru | Minsk. BY > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html