Hope this mail success using "plain text mode" We ever faced a problem when created radosgw object storage on ec pool(k=4, m=2). Using cosbench to do upload test, we got 700-800 ops of 4K size objects, but 2212.18 ops of 4K size objects from 3 replicated pool. Performance from ec is lower than 3 replicated pool, but they eventually have similar throughput when object size become bigger(4MB, 8MB or bigger). This phenomena is easy to explain, cpu is the bottleneck when upload small objects, but disks will be the bottleneck when upload bigger objects. Our customers always concern cost, so ec is a good choice to lower the cost of capacity; but it also brings trouble as mentioned above. So I wanted to find a way to improve the performance of ec for object storage based radosgw, and found a way to leverage capacity and performance. My opinion is that we should support store head and tail objects of radosgw object separately, which means that stores head objects in 3 replicated pool and tail objects in ec pool. So for small objects, we can get performance of 3 replicated pool, and we also benefit 67% capacity utility from ec(3 replicated only has 33%). Consider a scene: we want to upload big size(MB or GB) objects , we prefer to use multipart, radosgw will stripe every part to several tail rados objects but no head object and all of them will land in ec pool. So we will get similar throughput as 3 replicated pool for they are big objects, and we also benefit capacity utility. Pareto principle (also known as the 80–20 rule) also exists in some workload, that is 20% files/objects occupy 80% capacity. It is not just subjective guess, my company's share disk(like dropbox, storing department e-doc, software, and so on) obey the rule and even 85-15(15% files occupy 80% capacity). As the mail I replied several days(rejected by Mail Delivery Subsystem), if we introduce a tail_data_pool in placement rule to store tail objects. we can create replicated pool for data_pool and ec for tail_data_pool to leverage the performance and capacity. I have open a PR, and request for comments. https://github.com/ceph/ceph/pull/16325 thanks ivan from eisoo On Thu, Jul 6, 2017 at 7:00 PM, Jiaying Ren <mikulely@xxxxxxxxx> wrote: > Thanks all for your insight! > After more investigation,I'd like to > share some output, your comments are appreciated as always. ;-) > > * proposal > > ** introduce tail_data_pool > > Each storage class is presented as individual placement rule. Each > placement rule has serveral pools: > > + index_pool(for bucket index) > + data_pool(for head) > + tail_data_pool(for tail) > > Finally,different storage classes use the same index_pool and > data_pool, but different tail_data_pool. Using different storage > classes means using different tail_data_pools. > > Here's a placement rule/storage class config sample output: > > #+BEGIN_EXAMPLE > { > "key": "STANDARD", > "val": { > "index_pool": "us-east-1.rgw.buckets.index", > "data_pool": "us-east-1.rgw.buckets.data", > "tail_data_pool": "us-east-1.rgw.buckets.3replica", <- > introduced for rgw_obj raw data > "data_extra_pool": "us-east-1.rgw.buckets.non-ec", > "index_type": 0, > "compression": "", > "inline_head": 1 > } > }, > #+END_EXAMPLE > > Multipart rgw_obj will be stored at tail_data_pool. Further more,for > those rgw_obj only has head,not tail, we can refactor Manifest to > support disable inline first chunk data of rgw_obj into the head, > which can finally match the semantic of AWS S3 sotrage class: > > #+BEGIN_EXAMPLE > { > "key": "STANDARD", > "val": { > "index_pool": "us-east-1.rgw.buckets.index", > "data_pool": "us-east-1.rgw.buckets.data", > "tail_data_pool": "us-east-1.rgw.buckets.3replica", > "data_extra_pool": "us-east-1.rgw.buckets.non-ec", > "index_type": 0, > "compression": "", > "inline_head": 1 <- introduced for inline first data > chunk of rgw_obj into head > } > }, > #+END_EXAMPLE > > ** expose different storage class as individual placement rule > > As draft ,placment list will list all storage class: > > #+BEGIN_EXAMPLE > ./bin/radosgw-admin -c ceph.conf zone placement list > [ > { > "key": "STANDARD", > "val": { > "index_pool": "us-east-1.rgw.buckets.index", > "data_pool": "us-east-1.rgw.buckets.data", > "tail_data_pool": "us-east-1.rgw.buckets.3replica", > "data_extra_pool": "us-east-1.rgw.buckets.non-ec", > "index_type": 0, > "compression": "", > "inline_head": 1 > } > }, > > { > "key": "RRS", > "val": { > "index_pool": "us-east-1.rgw.buckets.index", > "data_pool": "us-east-1.rgw.buckets.data", > "tail_data_pool": "us-east-1.rgw.buckets.2replica", > "data_extra_pool": "us-east-1.rgw.buckets.non-ec", > "index_type": 0, > "compression": "" > "inline_head": 1 > } > } > ] > #+END_EXAMPLE > > Another option would be expose serveral storage classes in the same > placement rule: > > #+BEGIN_EXAMPLE > ./bin/radosgw-admin -c ceph.conf zone placement list > [ > { > "key": "default-placement", > "val": { > "index_pool": "us-east-1.rgw.buckets.index", > "storage_class" > { > "STANDARD" : { > "data_pool": "us-east-1.rgw.3replica", > "data_extra_pool": "us-east-1.rgw.buckets.non-ec", > "inline_head": 1 > }, > "RRS" : { > "data_pool": "us-east-1.rgw.2replica", > "data_extra_pool": "us-east-1.rgw.buckets.non-ec", > "inline_head": 1 > }, > } > "index_type": 0, > "compression": "" > } > } > ] > #+END_EXAMPLE > > This approach strict the meaning of storage class as different data > pool. But we may support things like Multi-Regional Storage ( > https://cloud.google.com/storage/docs/storage-classes#multi-regional ) > in the future. So I'd prefer expost storage class at placement rule > level. > > * issues > > If we introduced the tail_data_pool,we need corresponding > modification. I'm not sure about this, feedback are appreciated. > > ** use rgw_pool instead of placment rule in the RGWManifest > > In the RGWObjManifest, we've defined two placement rules: > > + head_placement_rule > (https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.h#L406) > + tail_placement.placement_rule > (https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.h#L119) > > then we use placment rule to find the data_pool of the placement > rule.If we introduced the tail_data_pool,there's no need to keep > tail_placement.placement_rule(although it is the same as > head_placement_rule) > > In the RGWObjManifest internal, `class rgw_obj_select`also defined a > `placement_rule` > (https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.h#L127), > which finally used placement rule to find the data_pool of that > placement rule. > > So I suppose to instead of using placement rule in the > RGWManifest, replaced with rgw_pool.so that we've the chance to use > tail_data_pool and data_pool in the same placement rule. > > On 23 June 2017 at 13:43, 方钰翔 <abcdeffyx@xxxxxxxxx> wrote: >> I think storing the head object and tail objects in different pools is also >> necessary. >> >> If we introduce a tail_data_pool in placement rule to store tail objects. we >> can create replicated pool for data_pool and ec for tail_data_pool to >> leverage the performance and capacity. >> >> 2017-06-22 17:44 GMT+08:00 Jiaying Ren <mikulely@xxxxxxxxx>: >>> >>> On 21 June 2017 at 23:50, Daniel Gryniewicz <dang@xxxxxxxxxx> wrote: >>> >>> >>> >> My original thinking was that when we reassign an object to a new >>> >> placement, we only touch its tail which is incompatible with that. >>> >> However, thinking about it some more I don't see why we need to have >>> >> this limitation, so it's probably possible to keep the data in the >>> >> head in one case, and modify the object and have the data in the tail >>> >> (object's head will need to be rewritten anyway because we modify the >>> >> manifest). >>> >> I think that the decision whether we keep data in the head could be a >>> >> property of the zone. >>> >>> Yes, I guess we also need to check the zone placement rule config when >>> pull the realm in the multisite env, to make sure the sync peer has >>> the same storage class support, multisite sync should also respect >>> object storage class. >>> >>> >> In any case, once an object is created changing >>> >> this property will only affect newly created objects, and old objects >>> >> could still be read correctly. Having data in the head is an >>> >> optimization that supposedly reduces small objects latency, and I >>> >> still think it's useful in a mixed pools situation. The thought is >>> >> that the bulk of the data will be at the tail anyway. However, we >>> >> recently changed the default head size from 512k to 4M, so this might >>> >> not be true any more. Anyhow, I favour having this as a configurable >>> >> (which should be simple to add). >>> >> >>> >> Yehuda >>> >> >>> > >>> > >>> > I would be strongly against keeping data in the head when the head is in >>> > a >>> > lower-level storage class. That means that the entire object is >>> > violating >>> > the constraints of the storage class. >>> >>> Agreed. The default behavior of storage class require us to keep the >>> data in the head as the same pool as the tail. Even if we made this as >>> a configureable option, we should disable this kind of inline by >>> default to match the default behavior of storage class. >>> >>> > >>> > Of course, having the head in a lower storage class (data or not) is >>> > probably a violation. Maybe we'd have to require that all heads go in >>> > the >>> > highest storage class. >>> > >>> > Daniel >>> >>> On 21 June 2017 at 23:50, Daniel Gryniewicz <dang@xxxxxxxxxx> wrote: >>> > On 06/21/2017 11:14 AM, Yehuda Sadeh-Weinraub wrote: >>> >> >>> >> On Wed, Jun 21, 2017 at 7:46 AM, Daniel Gryniewicz <dang@xxxxxxxxxx> >>> >> wrote: >>> >>> >>> >>> On 06/21/2017 10:04 AM, Matt Benjamin wrote: >>> >>>> >>> >>>> >>> >>>> Hi, >>> >>>> >>> >>>> Looks very coherent. >>> >>>> >>> >>>> My main question is about... >>> >>>> >>> >>>> ----- Original Message ----- >>> >>>>> >>> >>>>> >>> >>>>> From: "Jiaying Ren" <mikulely@xxxxxxxxx> >>> >>>>> To: "Yehuda Sadeh-Weinraub" <ysadehwe@xxxxxxxxxx> >>> >>>>> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>> >>>>> Sent: Wednesday, June 21, 2017 7:39:24 AM >>> >>>>> Subject: RGW: Implement S3 storage class feature >>> >>>>> >>> >>>> >>> >>>>> >>> >>>>> * Todo List >>> >>>>> >>> >>>>> + the head of rgw-object should only contains the metadata of >>> >>>>> rgw-object,the first chunk of rgw-object data should be stored in >>> >>>>> the same pool as the tail of rgw-object >>> >>>> >>> >>>> >>> >>>> >>> >>>> Is this always desirable? >>> >>>> >>> >>> >>> >>> Well, unless the head pool happens to have the correct storage class, >>> >>> it's >>> >>> necessary. And I'd guess that verification of this is complicated, >>> >>> although >>> >>> maybe not. >>> >>> >>> >>> Maybe we can use the head pool if it has >= the correct storage class? >>> >>> >>> >> My original thinking was that when we reassign an object to a new >>> >> placement, we only touch its tail which is incompatible with that. >>> >> However, thinking about it some more I don't see why we need to have >>> >> this limitation, so it's probably possible to keep the data in the >>> >> head in one case, and modify the object and have the data in the tail >>> >> (object's head will need to be rewritten anyway because we modify the >>> >> manifest). >>> >> I think that the decision whether we keep data in the head could be a >>> >> property of the zone. In any case, once an object is created changing >>> >> this property will only affect newly created objects, and old objects >>> >> could still be read correctly. Having data in the head is an >>> >> optimization that supposedly reduces small objects latency, and I >>> >> still think it's useful in a mixed pools situation. The thought is >>> >> that the bulk of the data will be at the tail anyway. However, we >>> >> recently changed the default head size from 512k to 4M, so this might >>> >> not be true any more. Anyhow, I favour having this as a configurable >>> >> (which should be simple to add). >>> >> >>> >> Yehuda >>> >> >>> > >>> > >>> > I would be strongly against keeping data in the head when the head is in >>> > a >>> > lower-level storage class. That means that the entire object is >>> > violating >>> > the constraints of the storage class. >>> > >>> > Of course, having the head in a lower storage class (data or not) is >>> > probably a violation. Maybe we'd have to require that all heads go in >>> > the >>> > highest storage class. >>> > >>> > Daniel >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html