On Thu, Sep 7, 2017 at 10:04 PM, David Turner <drakonstein@xxxxxxxxx> wrote: > One realm is called public with a zonegroup called public-zg with a zone for > each datacenter. The second realm is called internal with a zonegroup > called internal-zg with a zone for each datacenter. they each have their > own rgw's and load balancers. The needs of our public facing rgw's and load > balancers vs internal use ones was different enough that we split them up > completely. We also have a local realm that does not use multisite and a > 4th realm called QA that mimics the public realm as much as possible for > staging configuration stages for the rgw daemons. All 4 realms have their > own buckets, users, etc and that is all working fine. For all of the > radosgw-admin commands I am using the proper identifiers to make sure that > each datacenter and realm are running commands on exactly what I expect them > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1 > --source-zone=public-dc2). > > The data sync issue was in the internal realm but running a data sync init > and kickstarting the rgw daemons in each datacenter fixed the data > discrepancies (I'm thinking it had something to do with a power failure a > few months back that I just noticed recently). The metadata sync issue is > in the public realm. I have no idea what is causing this to not sync > properly since running a `metadata sync init` catches it back up to the > primary zone, but then it doesn't receive any new users created after that. > Sounds like an issue with the metadata log in the primary master zone. Not sure what could go wrong there, but maybe the master zone doesn't know that it is a master zone, or it's set to not log metadata. Or maybe there's a problem when the secondary is trying to fetch the metadata log. Maybe some kind of # of shards mismatch (though not likely). Try to see if the master logs any changes: should use the 'radosgw-admin mdlog list' command. Yehuda > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> > wrote: >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner <drakonstein@xxxxxxxxx> >> wrote: >> > Ok, I've been testing, investigating, researching, etc for the last week >> > and >> > I don't have any problems with data syncing. The clients on one side >> > are >> > creating multipart objects while the multisite sync is creating them as >> > whole objects and one of the datacenters is slower at cleaning up the >> > shadow >> > files. That's the big discrepancy between object counts in the pools >> > between datacenters. I created a tool that goes through for each bucket >> > in >> > a realm and does a recursive listing of all objects in it for both >> > datacenters and compares the 2 lists for any differences. The data is >> > definitely in sync between the 2 datacenters down to the modified time >> > and >> > byte of each file in s3. >> > >> > The metadata is still not syncing for the other realm, though. If I run >> > `metadata sync init` then the second datacenter will catch up with all >> > of >> > the new users, but until I do that newly created users on the primary >> > side >> > don't exist on the secondary side. `metadata sync status`, `sync >> > status`, >> > `metadata sync run` (only left running for 30 minutes before I ctrl+c >> > it), >> > etc don't show any problems... but the new users just don't exist on the >> > secondary side until I run `metadata sync init`. I created a new bucket >> > with the new user and the bucket shows up in the second datacenter, but >> > no >> > objects because the objects don't have a valid owner. >> > >> > Thank you all for the help with the data sync issue. You pushed me into >> > good directions. Does anyone have any insight as to what is preventing >> > the >> > metadata from syncing in the other realm? I have 2 realms being sync >> > using >> > multi-site and it's only 1 of them that isn't getting the metadata >> > across. >> > As far as I can tell it is configured identically. >> >> What do you mean you have two realms? Zones and zonegroups need to >> exist in the same realm in order for meta and data sync to happen >> correctly. Maybe I'm misunderstanding. >> >> Yehuda >> >> > >> > On Thu, Aug 31, 2017 at 12:46 PM David Turner <drakonstein@xxxxxxxxx> >> > wrote: >> >> >> >> All of the messages from sync error list are listed below. The number >> >> on >> >> the left is how many times the error message is found. >> >> >> >> 1811 "message": "failed to sync bucket instance: >> >> (16) Device or resource busy" >> >> 7 "message": "failed to sync bucket instance: >> >> (5) Input\/output error" >> >> 65 "message": "failed to sync object" >> >> >> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman <owasserm@xxxxxxxxxx> >> >> wrote: >> >>> >> >>> >> >>> Hi David, >> >>> >> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner <drakonstein@xxxxxxxxx> >> >>> wrote: >> >>>> >> >>>> The vast majority of the sync error list is "failed to sync bucket >> >>>> instance: (16) Device or resource busy". I can't find anything on >> >>>> Google >> >>>> about this error message in relation to Ceph. Does anyone have any >> >>>> idea >> >>>> what this means? and/or how to fix it? >> >>> >> >>> >> >>> Those are intermediate errors resulting from several radosgw trying to >> >>> acquire the same sync log shard lease. It doesn't effect the sync >> >>> progress. >> >>> Are there any other errors? >> >>> >> >>> Orit >> >>>> >> >>>> >> >>>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley <cbodley@xxxxxxxxxx> >> >>>> wrote: >> >>>>> >> >>>>> Hi David, >> >>>>> >> >>>>> The 'data sync init' command won't touch any actual object data, no. >> >>>>> Resetting the data sync status will just cause a zone to restart a >> >>>>> full sync >> >>>>> of the --source-zone's data changes log. This log only lists which >> >>>>> buckets/shards have changes in them, which causes radosgw to >> >>>>> consider them >> >>>>> for bucket sync. So while the command may silence the warnings about >> >>>>> data >> >>>>> shards being behind, it's unlikely to resolve the issue with missing >> >>>>> objects >> >>>>> in those buckets. >> >>>>> >> >>>>> When data sync is behind for an extended period of time, it's >> >>>>> usually >> >>>>> because it's stuck retrying previous bucket sync failures. The 'sync >> >>>>> error >> >>>>> list' may help narrow down where those failures are. >> >>>>> >> >>>>> There is also a 'bucket sync init' command to clear the bucket sync >> >>>>> status. Following that with a 'bucket sync run' should restart a >> >>>>> full sync >> >>>>> on the bucket, pulling in any new objects that are present on the >> >>>>> source-zone. I'm afraid that those commands haven't seen a lot of >> >>>>> polish or >> >>>>> testing, however. >> >>>>> >> >>>>> Casey >> >>>>> >> >>>>> >> >>>>> On 08/24/2017 04:15 PM, David Turner wrote: >> >>>>> >> >>>>> Apparently the data shards that are behind go in both directions, >> >>>>> but >> >>>>> only one zone is aware of the problem. Each cluster has objects in >> >>>>> their >> >>>>> data pool that the other doesn't have. I'm thinking about >> >>>>> initiating a >> >>>>> `data sync init` on both sides (one at a time) to get them back on >> >>>>> the same >> >>>>> page. Does anyone know if that command will overwrite any local >> >>>>> data that >> >>>>> the zone has that the other doesn't if you run `data sync init` on >> >>>>> it? >> >>>>> >> >>>>> On Thu, Aug 24, 2017 at 1:51 PM David Turner <drakonstein@xxxxxxxxx> >> >>>>> wrote: >> >>>>>> >> >>>>>> After restarting the 2 RGW daemons on the second site again, >> >>>>>> everything caught up on the metadata sync. Is there something >> >>>>>> about having >> >>>>>> 2 RGW daemons on each side of the multisite that might be causing >> >>>>>> an issue >> >>>>>> with the sync getting stale? I have another realm set up the same >> >>>>>> way that >> >>>>>> is having a hard time with its data shards being behind. I haven't >> >>>>>> told >> >>>>>> them to resync, but yesterday I noticed 90 shards were behind. >> >>>>>> It's caught >> >>>>>> back up to only 17 shards behind, but the oldest change not applied >> >>>>>> is 2 >> >>>>>> months old and no order of restarting RGW daemons is helping to >> >>>>>> resolve >> >>>>>> this. >> >>>>>> >> >>>>>> On Thu, Aug 24, 2017 at 10:59 AM David Turner >> >>>>>> <drakonstein@xxxxxxxxx> >> >>>>>> wrote: >> >>>>>>> >> >>>>>>> I have a RGW Multisite 10.2.7 set up for bi-directional syncing. >> >>>>>>> This has been operational for 5 months and working fine. I >> >>>>>>> recently created >> >>>>>>> a new user on the master zone, used that user to create a bucket, >> >>>>>>> and put in >> >>>>>>> a public-acl object in there. The Bucket created on the second >> >>>>>>> site, but >> >>>>>>> the user did not and the object errors out complaining about the >> >>>>>>> access_key >> >>>>>>> not existing. >> >>>>>>> >> >>>>>>> That led me to think that the metadata isn't syncing, while bucket >> >>>>>>> and data both are. I've also confirmed that data is syncing for >> >>>>>>> other >> >>>>>>> buckets as well in both directions. The sync status from the >> >>>>>>> second site was >> >>>>>>> this. >> >>>>>>> >> >>>>>>> metadata sync syncing >> >>>>>>> >> >>>>>>> full sync: 0/64 shards >> >>>>>>> >> >>>>>>> incremental sync: 64/64 shards >> >>>>>>> >> >>>>>>> metadata is caught up with master >> >>>>>>> >> >>>>>>> data sync source: f4c12327-4721-47c9-a365-86332d84c227 >> >>>>>>> (public-atl01) >> >>>>>>> >> >>>>>>> syncing >> >>>>>>> >> >>>>>>> full sync: 0/128 shards >> >>>>>>> >> >>>>>>> incremental sync: 128/128 shards >> >>>>>>> >> >>>>>>> data is caught up with source >> >>>>>>> >> >>>>>>> >> >>>>>>> Sync status leads me to think that the second site believes it is >> >>>>>>> up >> >>>>>>> to date, even though it is missing a freshly created user. I >> >>>>>>> restarted all >> >>>>>>> of the rgw daemons for the zonegroup, but it didn't trigger >> >>>>>>> anything to fix >> >>>>>>> the missing user in the second site. I did some googling and >> >>>>>>> found the sync >> >>>>>>> init commands mentioned in a few ML posts and used metadata sync >> >>>>>>> init and >> >>>>>>> now have this as the sync status. >> >>>>>>> >> >>>>>>> metadata sync preparing for full sync >> >>>>>>> >> >>>>>>> full sync: 64/64 shards >> >>>>>>> >> >>>>>>> full sync: 0 entries to sync >> >>>>>>> >> >>>>>>> incremental sync: 0/64 shards >> >>>>>>> >> >>>>>>> metadata is behind on 70 shards >> >>>>>>> >> >>>>>>> oldest incremental change not applied: 2017-03-01 >> >>>>>>> 21:13:43.0.126971s >> >>>>>>> >> >>>>>>> data sync source: f4c12327-4721-47c9-a365-86332d84c227 >> >>>>>>> (public-atl01) >> >>>>>>> >> >>>>>>> syncing >> >>>>>>> >> >>>>>>> full sync: 0/128 shards >> >>>>>>> >> >>>>>>> incremental sync: 128/128 shards >> >>>>>>> >> >>>>>>> data is caught up with source >> >>>>>>> >> >>>>>>> >> >>>>>>> It definitely triggered a fresh sync and told it to forget about >> >>>>>>> what >> >>>>>>> it's previously applied as the date of the oldest change not >> >>>>>>> applied is the >> >>>>>>> day we initially set up multisite for this zone. The problem is >> >>>>>>> that was >> >>>>>>> over 12 hours ago and the sync stat hasn't caught up on any shards >> >>>>>>> yet. >> >>>>>>> >> >>>>>>> Does anyone have any suggestions other than blast the second site >> >>>>>>> and >> >>>>>>> set it back up with a fresh start (the only option I can think of >> >>>>>>> at this >> >>>>>>> point)? >> >>>>>>> >> >>>>>>> Thank you, >> >>>>>>> David Turner >> >>>>> >> >>>>> >> >>>>> >> >>>>> _______________________________________________ >> >>>>> ceph-users mailing list >> >>>>> ceph-users@xxxxxxxxxxxxxx >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>>>> >> >>>>> >> >>>>> _______________________________________________ >> >>>>> ceph-users mailing list >> >>>>> ceph-users@xxxxxxxxxxxxxx >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> ceph-users mailing list >> >>>> ceph-users@xxxxxxxxxxxxxx >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>>> >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com