Re: RGW - Multisite setup -> question about Bucket - Sharding, limitations and synchronization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Jul 30, 2019, at 7:49 AM, Mainor Daly <ceph@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> Hello,
> 
> (everything in context of S3)
> 
> 
> I'm currently trying to better understand bucket sharding in combination with an multisite - rgw setup and possible limitations.
> 
> At the moment I understand that a bucket has a bucket index, which is a list of objects within the bucket.
> 
> There are also indexless buckets, but those are not usable for cases like a multisite - rgw bucket, where your need a [delayed] consistent relation/state between bucket n [zone a] and bucket n [zone b].
> 
> Those bucket indexes are stored in "shards" and shards get distributed over to whole zone - cluster for scaling purposes.
> Redhat recommends a maximum size of 102,400 object per shard and recommend this forumular to determine the right shard size for a cluster:
> 
> number of objects expected in a bucket / 100,000 
> max number of supported shards (or tested limit) is 7877 shard.

Back in 2017 this maximum number of shards changed to 65521. This change is in luminous, mimic, and nautilus.

> That results in a total limit of 787.700.000 objects, as long you wanna stay in known and tested water.
> 
> Now some the things I did not 100% understand:
> 
> = QUESTION 1 =
> 
> Does each bucket has it's own shards? E.g
> 
> Bucket 1 reached it's shard limit at 7877 shard, can i then create other  Buckets wish start with their own frish sets of shards?
> OR is it the other way around which would mean all buckets save their Index in the the same shards and if i reach the shard limit I need to create a second cluster?

Correct, each bucket has its own bucket index. And each bucket index can be sharded.

> = QUESTION 2 =
> How are this shards distrbuted over the cluster? I expect they are just objects in the rgw.bucket.index pool, is that correct?
> So. those one:
> rados ls -p a.rgw.buckets.index 
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.274451.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.87683.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.64716.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.78046.2

They are just objects and distributed via the CRUSH algorithm.

> = QUESTION 3 = 
> 
> 
> Does this Bucket Index Shards, has any relation to the RGW Sync shards in a rgw multisite setup?
> E.g. If I have a ton of bucket index shards or buckets, does it have any impact on the sync shards? 

They’re separate.

> radosgw-admin sync status
> realm f0019e09-c830-4fe8-a992-435e6f463b7c (mumu_1)
> zonegroup 307a1bb5-4d93-4a01-af21-0d8467b9bdfe (EU_1)
> zone 5a9c4d16-27a6-4721-aeda-b1a539b3d73a (b)
> metadata sync syncing
> full sync: 0/64 shards                    <= this ones I mean
> incremental sync: 64/64 shards
> metadata is caught up with master
> data sync source: 3638e3a4-8dde-42ee-812a-f98e266548a4 (a)
> syncing
> full sync: 0/128 shards   <= and this ones
> incremental sync: 128/128 shards <= and this ones
> data is caught up with source
> 
> 
> (swi to sync shard related topics)
> = QUESTION 4 = 
> (switching to sync shard related topics)
> 
> 
> What is the exact function and purpose of the sync shards? Do they implement any limit? E.g. maybe a maximum amount of objects entries that waits for synchronization to zone b.

They contain logs of items that need to be synced between zones. RGWs will look at them and sync objects. These logs are sharded so different RGWs can take on different shards and work on syncing in parallel.

> = QUESTION 5 = 
> Are those  Sync Shards processed parallel or sequentially? And where are those shards stored?

They’re sharded to allow parallelism. At any given moment, each shard is claimed by (locked by) one RGW. And each RGW may be claiming multiple shards. Collectively, all RGW are claiming all shards. Each RGW is syncing multiple shards in parallel and all RGWs are doing this in parallel. So in some sense there are two levels of parallelism.

> = QUESTION 6 = 
> As far as I have experienced the sync process pretty much works like that:
> 
> 1.) The client sends a object or a operation to a rados gateway A (RGW A)
> 2.) RGW A logs this operation into one of it's sync shards and execute the operation to it's local storage pool
> 3.) RGW B checks via get requests in a regular intervall if any new entries in the RGW A log appears 
> 4.) If a new entry exists RGW B it's execute the operation to it's local pool or pulls the new object from RGW A
> 
> Did I understand that correct? (For my roughly description of this functionality, I want to apologize at the developers which for sure invested much time and effort into design and building of that sync - process)

That’s about right.

> And If I understand it correct, how would look the exact strategy in a multisite - setup to resync e.g. a single bucket at which one zone got corrupted and must be get back into a synchronous state?

Be aware that there are full syncs and incremental syncs. Full syncs just copy every object. Incremental syncs use logs to sync selectively. Perhaps Casey will weigh in and discuss the state transitions.

> Hope thats the correct place to ask such questions.
> 
> Best Regards,
> Daly


--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux