Hi~ yuxiang: Glad to see you've pushed this forward! will take a review. On 14 July 2017 at 01:48, yuxiang fang <abcdeffyx@xxxxxxxxx> wrote: > Hope this mail success using "plain text mode" > > We ever faced a problem when created radosgw object storage on ec > pool(k=4, m=2). Using cosbench to do upload test, we got 700-800 ops > of 4K size objects, but 2212.18 ops of 4K size objects from 3 > replicated pool. Performance from ec is lower than 3 replicated pool, > but they eventually have similar throughput when object size become > bigger(4MB, 8MB or bigger). This phenomena is easy to explain, cpu is > the bottleneck when upload small objects, but disks will be the > bottleneck when upload bigger objects. > > Our customers always concern cost, so ec is a good choice to lower the > cost of capacity; but it also brings trouble as mentioned above. > So I wanted to find a way to improve the performance of ec for object > storage based radosgw, and found a way to leverage capacity and > performance. > > My opinion is that we should support store head and tail objects of > radosgw object separately, which means that stores head objects in 3 > replicated pool and tail objects in ec pool. So for small objects, we > can get performance of 3 replicated pool, and we also benefit 67% > capacity utility from ec(3 replicated only has 33%). > > Consider a scene: we want to upload big size(MB or GB) objects , we > prefer to use multipart, radosgw will stripe every part to several > tail rados objects but no head object and all of them will land in ec > pool. So we will get similar throughput as 3 replicated pool for they > are big objects, and we also benefit capacity utility. > > Pareto principle (also known as the 80–20 rule) also exists in some > workload, that is 20% files/objects occupy 80% capacity. It is not > just subjective guess, my company's share disk(like dropbox, storing > department e-doc, software, and so on) obey the rule and even > 85-15(15% files occupy 80% capacity). > > As the mail I replied several days(rejected by Mail Delivery > Subsystem), if we introduce a tail_data_pool in placement rule to > store tail objects. we can create replicated pool for data_pool and ec > for tail_data_pool to leverage the performance and capacity. > > I have open a PR, and request for comments. > https://github.com/ceph/ceph/pull/16325 > > > thanks > ivan from eisoo > > > On Thu, Jul 6, 2017 at 7:00 PM, Jiaying Ren <mikulely@xxxxxxxxx> wrote: >> Thanks all for your insight! >> After more investigation,I'd like to >> share some output, your comments are appreciated as always. ;-) >> >> * proposal >> >> ** introduce tail_data_pool >> >> Each storage class is presented as individual placement rule. Each >> placement rule has serveral pools: >> >> + index_pool(for bucket index) >> + data_pool(for head) >> + tail_data_pool(for tail) >> >> Finally,different storage classes use the same index_pool and >> data_pool, but different tail_data_pool. Using different storage >> classes means using different tail_data_pools. >> >> Here's a placement rule/storage class config sample output: >> >> #+BEGIN_EXAMPLE >> { >> "key": "STANDARD", >> "val": { >> "index_pool": "us-east-1.rgw.buckets.index", >> "data_pool": "us-east-1.rgw.buckets.data", >> "tail_data_pool": "us-east-1.rgw.buckets.3replica", <- >> introduced for rgw_obj raw data >> "data_extra_pool": "us-east-1.rgw.buckets.non-ec", >> "index_type": 0, >> "compression": "", >> "inline_head": 1 >> } >> }, >> #+END_EXAMPLE >> >> Multipart rgw_obj will be stored at tail_data_pool. Further more,for >> those rgw_obj only has head,not tail, we can refactor Manifest to >> support disable inline first chunk data of rgw_obj into the head, >> which can finally match the semantic of AWS S3 sotrage class: >> >> #+BEGIN_EXAMPLE >> { >> "key": "STANDARD", >> "val": { >> "index_pool": "us-east-1.rgw.buckets.index", >> "data_pool": "us-east-1.rgw.buckets.data", >> "tail_data_pool": "us-east-1.rgw.buckets.3replica", >> "data_extra_pool": "us-east-1.rgw.buckets.non-ec", >> "index_type": 0, >> "compression": "", >> "inline_head": 1 <- introduced for inline first data >> chunk of rgw_obj into head >> } >> }, >> #+END_EXAMPLE >> >> ** expose different storage class as individual placement rule >> >> As draft ,placment list will list all storage class: >> >> #+BEGIN_EXAMPLE >> ./bin/radosgw-admin -c ceph.conf zone placement list >> [ >> { >> "key": "STANDARD", >> "val": { >> "index_pool": "us-east-1.rgw.buckets.index", >> "data_pool": "us-east-1.rgw.buckets.data", >> "tail_data_pool": "us-east-1.rgw.buckets.3replica", >> "data_extra_pool": "us-east-1.rgw.buckets.non-ec", >> "index_type": 0, >> "compression": "", >> "inline_head": 1 >> } >> }, >> >> { >> "key": "RRS", >> "val": { >> "index_pool": "us-east-1.rgw.buckets.index", >> "data_pool": "us-east-1.rgw.buckets.data", >> "tail_data_pool": "us-east-1.rgw.buckets.2replica", >> "data_extra_pool": "us-east-1.rgw.buckets.non-ec", >> "index_type": 0, >> "compression": "" >> "inline_head": 1 >> } >> } >> ] >> #+END_EXAMPLE >> >> Another option would be expose serveral storage classes in the same >> placement rule: >> >> #+BEGIN_EXAMPLE >> ./bin/radosgw-admin -c ceph.conf zone placement list >> [ >> { >> "key": "default-placement", >> "val": { >> "index_pool": "us-east-1.rgw.buckets.index", >> "storage_class" >> { >> "STANDARD" : { >> "data_pool": "us-east-1.rgw.3replica", >> "data_extra_pool": "us-east-1.rgw.buckets.non-ec", >> "inline_head": 1 >> }, >> "RRS" : { >> "data_pool": "us-east-1.rgw.2replica", >> "data_extra_pool": "us-east-1.rgw.buckets.non-ec", >> "inline_head": 1 >> }, >> } >> "index_type": 0, >> "compression": "" >> } >> } >> ] >> #+END_EXAMPLE >> >> This approach strict the meaning of storage class as different data >> pool. But we may support things like Multi-Regional Storage ( >> https://cloud.google.com/storage/docs/storage-classes#multi-regional ) >> in the future. So I'd prefer expost storage class at placement rule >> level. >> >> * issues >> >> If we introduced the tail_data_pool,we need corresponding >> modification. I'm not sure about this, feedback are appreciated. >> >> ** use rgw_pool instead of placment rule in the RGWManifest >> >> In the RGWObjManifest, we've defined two placement rules: >> >> + head_placement_rule >> (https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.h#L406) >> + tail_placement.placement_rule >> (https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.h#L119) >> >> then we use placment rule to find the data_pool of the placement >> rule.If we introduced the tail_data_pool,there's no need to keep >> tail_placement.placement_rule(although it is the same as >> head_placement_rule) >> >> In the RGWObjManifest internal, `class rgw_obj_select`also defined a >> `placement_rule` >> (https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.h#L127), >> which finally used placement rule to find the data_pool of that >> placement rule. >> >> So I suppose to instead of using placement rule in the >> RGWManifest, replaced with rgw_pool.so that we've the chance to use >> tail_data_pool and data_pool in the same placement rule. >> >> On 23 June 2017 at 13:43, 方钰翔 <abcdeffyx@xxxxxxxxx> wrote: >>> I think storing the head object and tail objects in different pools is also >>> necessary. >>> >>> If we introduce a tail_data_pool in placement rule to store tail objects. we >>> can create replicated pool for data_pool and ec for tail_data_pool to >>> leverage the performance and capacity. >>> >>> 2017-06-22 17:44 GMT+08:00 Jiaying Ren <mikulely@xxxxxxxxx>: >>>> >>>> On 21 June 2017 at 23:50, Daniel Gryniewicz <dang@xxxxxxxxxx> wrote: >>>> >>> >>>> >> My original thinking was that when we reassign an object to a new >>>> >> placement, we only touch its tail which is incompatible with that. >>>> >> However, thinking about it some more I don't see why we need to have >>>> >> this limitation, so it's probably possible to keep the data in the >>>> >> head in one case, and modify the object and have the data in the tail >>>> >> (object's head will need to be rewritten anyway because we modify the >>>> >> manifest). >>>> >> I think that the decision whether we keep data in the head could be a >>>> >> property of the zone. >>>> >>>> Yes, I guess we also need to check the zone placement rule config when >>>> pull the realm in the multisite env, to make sure the sync peer has >>>> the same storage class support, multisite sync should also respect >>>> object storage class. >>>> >>>> >> In any case, once an object is created changing >>>> >> this property will only affect newly created objects, and old objects >>>> >> could still be read correctly. Having data in the head is an >>>> >> optimization that supposedly reduces small objects latency, and I >>>> >> still think it's useful in a mixed pools situation. The thought is >>>> >> that the bulk of the data will be at the tail anyway. However, we >>>> >> recently changed the default head size from 512k to 4M, so this might >>>> >> not be true any more. Anyhow, I favour having this as a configurable >>>> >> (which should be simple to add). >>>> >> >>>> >> Yehuda >>>> >> >>>> > >>>> > >>>> > I would be strongly against keeping data in the head when the head is in >>>> > a >>>> > lower-level storage class. That means that the entire object is >>>> > violating >>>> > the constraints of the storage class. >>>> >>>> Agreed. The default behavior of storage class require us to keep the >>>> data in the head as the same pool as the tail. Even if we made this as >>>> a configureable option, we should disable this kind of inline by >>>> default to match the default behavior of storage class. >>>> >>>> > >>>> > Of course, having the head in a lower storage class (data or not) is >>>> > probably a violation. Maybe we'd have to require that all heads go in >>>> > the >>>> > highest storage class. >>>> > >>>> > Daniel >>>> >>>> On 21 June 2017 at 23:50, Daniel Gryniewicz <dang@xxxxxxxxxx> wrote: >>>> > On 06/21/2017 11:14 AM, Yehuda Sadeh-Weinraub wrote: >>>> >> >>>> >> On Wed, Jun 21, 2017 at 7:46 AM, Daniel Gryniewicz <dang@xxxxxxxxxx> >>>> >> wrote: >>>> >>> >>>> >>> On 06/21/2017 10:04 AM, Matt Benjamin wrote: >>>> >>>> >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> Looks very coherent. >>>> >>>> >>>> >>>> My main question is about... >>>> >>>> >>>> >>>> ----- Original Message ----- >>>> >>>>> >>>> >>>>> >>>> >>>>> From: "Jiaying Ren" <mikulely@xxxxxxxxx> >>>> >>>>> To: "Yehuda Sadeh-Weinraub" <ysadehwe@xxxxxxxxxx> >>>> >>>>> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>> >>>>> Sent: Wednesday, June 21, 2017 7:39:24 AM >>>> >>>>> Subject: RGW: Implement S3 storage class feature >>>> >>>>> >>>> >>>> >>>> >>>>> >>>> >>>>> * Todo List >>>> >>>>> >>>> >>>>> + the head of rgw-object should only contains the metadata of >>>> >>>>> rgw-object,the first chunk of rgw-object data should be stored in >>>> >>>>> the same pool as the tail of rgw-object >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Is this always desirable? >>>> >>>> >>>> >>> >>>> >>> Well, unless the head pool happens to have the correct storage class, >>>> >>> it's >>>> >>> necessary. And I'd guess that verification of this is complicated, >>>> >>> although >>>> >>> maybe not. >>>> >>> >>>> >>> Maybe we can use the head pool if it has >= the correct storage class? >>>> >>> >>>> >> My original thinking was that when we reassign an object to a new >>>> >> placement, we only touch its tail which is incompatible with that. >>>> >> However, thinking about it some more I don't see why we need to have >>>> >> this limitation, so it's probably possible to keep the data in the >>>> >> head in one case, and modify the object and have the data in the tail >>>> >> (object's head will need to be rewritten anyway because we modify the >>>> >> manifest). >>>> >> I think that the decision whether we keep data in the head could be a >>>> >> property of the zone. In any case, once an object is created changing >>>> >> this property will only affect newly created objects, and old objects >>>> >> could still be read correctly. Having data in the head is an >>>> >> optimization that supposedly reduces small objects latency, and I >>>> >> still think it's useful in a mixed pools situation. The thought is >>>> >> that the bulk of the data will be at the tail anyway. However, we >>>> >> recently changed the default head size from 512k to 4M, so this might >>>> >> not be true any more. Anyhow, I favour having this as a configurable >>>> >> (which should be simple to add). >>>> >> >>>> >> Yehuda >>>> >> >>>> > >>>> > >>>> > I would be strongly against keeping data in the head when the head is in >>>> > a >>>> > lower-level storage class. That means that the entire object is >>>> > violating >>>> > the constraints of the storage class. >>>> > >>>> > Of course, having the head in a lower storage class (data or not) is >>>> > probably a violation. Maybe we'd have to require that all heads go in >>>> > the >>>> > highest storage class. >>>> > >>>> > Daniel >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html