On Thu, Sep 7, 2017 at 11:37 PM, David Turner <drakonstein@xxxxxxxxx> wrote: > I'm pretty sure I'm using the cluster admin user/keyring. Is there any > output that would be helpful? Period, zonegroup get, etc? - radosgw-admin period get - radosgw-admin zone list - radosgw-admin zonegroup list For each zone, zonegroup in result: - radosgw-admin zone get --rgw-zone=<zone> - radosgw-admin zonegroup get --rgw-zonegroup=<zonegroup> - rados lspools Also, create a user with --debug-rgw=20 --debug-ms=1, need to look at the log. Yehuda > > On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> > wrote: >> >> On Thu, Sep 7, 2017 at 11:02 PM, David Turner <drakonstein@xxxxxxxxx> >> wrote: >> > I created a test user named 'ice' and then used it to create a bucket >> > named >> > ice. The bucket ice can be found in the second dc, but not the user. >> > `mdlog list` showed ice for the bucket, but not for the user. I >> > performed >> > the same test in the internal realm and it showed the user and bucket >> > both >> > in `mdlog list`. >> > >> >> Maybe your radosgw-admin command is running with a ceph user that >> doesn't have permissions to write to the log pool? (probably not, >> because you are able to run the sync init commands). >> Another very slim explanation would be if you had for some reason >> overlapping zones configuration that shared some of the config but not >> all of it, having radosgw running against the correct one and >> radosgw-admin against the bad one. I don't think it's the second >> option. >> >> Yehuda >> >> > >> > >> > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> >> > wrote: >> >> >> >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner <drakonstein@xxxxxxxxx> >> >> wrote: >> >> > One realm is called public with a zonegroup called public-zg with a >> >> > zone >> >> > for >> >> > each datacenter. The second realm is called internal with a >> >> > zonegroup >> >> > called internal-zg with a zone for each datacenter. they each have >> >> > their >> >> > own rgw's and load balancers. The needs of our public facing rgw's >> >> > and >> >> > load >> >> > balancers vs internal use ones was different enough that we split >> >> > them >> >> > up >> >> > completely. We also have a local realm that does not use multisite >> >> > and >> >> > a >> >> > 4th realm called QA that mimics the public realm as much as possible >> >> > for >> >> > staging configuration stages for the rgw daemons. All 4 realms have >> >> > their >> >> > own buckets, users, etc and that is all working fine. For all of the >> >> > radosgw-admin commands I am using the proper identifiers to make sure >> >> > that >> >> > each datacenter and realm are running commands on exactly what I >> >> > expect >> >> > them >> >> > to (--rgw-realm=public --rgw-zonegroup=public-zg >> >> > --rgw-zone=public-dc1 >> >> > --source-zone=public-dc2). >> >> > >> >> > The data sync issue was in the internal realm but running a data sync >> >> > init >> >> > and kickstarting the rgw daemons in each datacenter fixed the data >> >> > discrepancies (I'm thinking it had something to do with a power >> >> > failure >> >> > a >> >> > few months back that I just noticed recently). The metadata sync >> >> > issue >> >> > is >> >> > in the public realm. I have no idea what is causing this to not sync >> >> > properly since running a `metadata sync init` catches it back up to >> >> > the >> >> > primary zone, but then it doesn't receive any new users created after >> >> > that. >> >> > >> >> >> >> Sounds like an issue with the metadata log in the primary master zone. >> >> Not sure what could go wrong there, but maybe the master zone doesn't >> >> know that it is a master zone, or it's set to not log metadata. Or >> >> maybe there's a problem when the secondary is trying to fetch the >> >> metadata log. Maybe some kind of # of shards mismatch (though not >> >> likely). >> >> Try to see if the master logs any changes: should use the >> >> 'radosgw-admin mdlog list' command. >> >> >> >> Yehuda >> >> >> >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub >> >> > <yehuda@xxxxxxxxxx> >> >> > wrote: >> >> >> >> >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner <drakonstein@xxxxxxxxx> >> >> >> wrote: >> >> >> > Ok, I've been testing, investigating, researching, etc for the >> >> >> > last >> >> >> > week >> >> >> > and >> >> >> > I don't have any problems with data syncing. The clients on one >> >> >> > side >> >> >> > are >> >> >> > creating multipart objects while the multisite sync is creating >> >> >> > them >> >> >> > as >> >> >> > whole objects and one of the datacenters is slower at cleaning up >> >> >> > the >> >> >> > shadow >> >> >> > files. That's the big discrepancy between object counts in the >> >> >> > pools >> >> >> > between datacenters. I created a tool that goes through for each >> >> >> > bucket >> >> >> > in >> >> >> > a realm and does a recursive listing of all objects in it for both >> >> >> > datacenters and compares the 2 lists for any differences. The >> >> >> > data >> >> >> > is >> >> >> > definitely in sync between the 2 datacenters down to the modified >> >> >> > time >> >> >> > and >> >> >> > byte of each file in s3. >> >> >> > >> >> >> > The metadata is still not syncing for the other realm, though. If >> >> >> > I >> >> >> > run >> >> >> > `metadata sync init` then the second datacenter will catch up with >> >> >> > all >> >> >> > of >> >> >> > the new users, but until I do that newly created users on the >> >> >> > primary >> >> >> > side >> >> >> > don't exist on the secondary side. `metadata sync status`, `sync >> >> >> > status`, >> >> >> > `metadata sync run` (only left running for 30 minutes before I >> >> >> > ctrl+c >> >> >> > it), >> >> >> > etc don't show any problems... but the new users just don't exist >> >> >> > on >> >> >> > the >> >> >> > secondary side until I run `metadata sync init`. I created a new >> >> >> > bucket >> >> >> > with the new user and the bucket shows up in the second >> >> >> > datacenter, >> >> >> > but >> >> >> > no >> >> >> > objects because the objects don't have a valid owner. >> >> >> > >> >> >> > Thank you all for the help with the data sync issue. You pushed >> >> >> > me >> >> >> > into >> >> >> > good directions. Does anyone have any insight as to what is >> >> >> > preventing >> >> >> > the >> >> >> > metadata from syncing in the other realm? I have 2 realms being >> >> >> > sync >> >> >> > using >> >> >> > multi-site and it's only 1 of them that isn't getting the metadata >> >> >> > across. >> >> >> > As far as I can tell it is configured identically. >> >> >> >> >> >> What do you mean you have two realms? Zones and zonegroups need to >> >> >> exist in the same realm in order for meta and data sync to happen >> >> >> correctly. Maybe I'm misunderstanding. >> >> >> >> >> >> Yehuda >> >> >> >> >> >> > >> >> >> > On Thu, Aug 31, 2017 at 12:46 PM David Turner >> >> >> > <drakonstein@xxxxxxxxx> >> >> >> > wrote: >> >> >> >> >> >> >> >> All of the messages from sync error list are listed below. The >> >> >> >> number >> >> >> >> on >> >> >> >> the left is how many times the error message is found. >> >> >> >> >> >> >> >> 1811 "message": "failed to sync bucket >> >> >> >> instance: >> >> >> >> (16) Device or resource busy" >> >> >> >> 7 "message": "failed to sync bucket >> >> >> >> instance: >> >> >> >> (5) Input\/output error" >> >> >> >> 65 "message": "failed to sync object" >> >> >> >> >> >> >> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman >> >> >> >> <owasserm@xxxxxxxxxx> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> >> >> >> >>> Hi David, >> >> >> >>> >> >> >> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner >> >> >> >>> <drakonstein@xxxxxxxxx> >> >> >> >>> wrote: >> >> >> >>>> >> >> >> >>>> The vast majority of the sync error list is "failed to sync >> >> >> >>>> bucket >> >> >> >>>> instance: (16) Device or resource busy". I can't find anything >> >> >> >>>> on >> >> >> >>>> Google >> >> >> >>>> about this error message in relation to Ceph. Does anyone have >> >> >> >>>> any >> >> >> >>>> idea >> >> >> >>>> what this means? and/or how to fix it? >> >> >> >>> >> >> >> >>> >> >> >> >>> Those are intermediate errors resulting from several radosgw >> >> >> >>> trying >> >> >> >>> to >> >> >> >>> acquire the same sync log shard lease. It doesn't effect the >> >> >> >>> sync >> >> >> >>> progress. >> >> >> >>> Are there any other errors? >> >> >> >>> >> >> >> >>> Orit >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley >> >> >> >>>> <cbodley@xxxxxxxxxx> >> >> >> >>>> wrote: >> >> >> >>>>> >> >> >> >>>>> Hi David, >> >> >> >>>>> >> >> >> >>>>> The 'data sync init' command won't touch any actual object >> >> >> >>>>> data, >> >> >> >>>>> no. >> >> >> >>>>> Resetting the data sync status will just cause a zone to >> >> >> >>>>> restart >> >> >> >>>>> a >> >> >> >>>>> full sync >> >> >> >>>>> of the --source-zone's data changes log. This log only lists >> >> >> >>>>> which >> >> >> >>>>> buckets/shards have changes in them, which causes radosgw to >> >> >> >>>>> consider them >> >> >> >>>>> for bucket sync. So while the command may silence the warnings >> >> >> >>>>> about >> >> >> >>>>> data >> >> >> >>>>> shards being behind, it's unlikely to resolve the issue with >> >> >> >>>>> missing >> >> >> >>>>> objects >> >> >> >>>>> in those buckets. >> >> >> >>>>> >> >> >> >>>>> When data sync is behind for an extended period of time, it's >> >> >> >>>>> usually >> >> >> >>>>> because it's stuck retrying previous bucket sync failures. The >> >> >> >>>>> 'sync >> >> >> >>>>> error >> >> >> >>>>> list' may help narrow down where those failures are. >> >> >> >>>>> >> >> >> >>>>> There is also a 'bucket sync init' command to clear the bucket >> >> >> >>>>> sync >> >> >> >>>>> status. Following that with a 'bucket sync run' should restart >> >> >> >>>>> a >> >> >> >>>>> full sync >> >> >> >>>>> on the bucket, pulling in any new objects that are present on >> >> >> >>>>> the >> >> >> >>>>> source-zone. I'm afraid that those commands haven't seen a lot >> >> >> >>>>> of >> >> >> >>>>> polish or >> >> >> >>>>> testing, however. >> >> >> >>>>> >> >> >> >>>>> Casey >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> On 08/24/2017 04:15 PM, David Turner wrote: >> >> >> >>>>> >> >> >> >>>>> Apparently the data shards that are behind go in both >> >> >> >>>>> directions, >> >> >> >>>>> but >> >> >> >>>>> only one zone is aware of the problem. Each cluster has >> >> >> >>>>> objects >> >> >> >>>>> in >> >> >> >>>>> their >> >> >> >>>>> data pool that the other doesn't have. I'm thinking about >> >> >> >>>>> initiating a >> >> >> >>>>> `data sync init` on both sides (one at a time) to get them >> >> >> >>>>> back >> >> >> >>>>> on >> >> >> >>>>> the same >> >> >> >>>>> page. Does anyone know if that command will overwrite any >> >> >> >>>>> local >> >> >> >>>>> data that >> >> >> >>>>> the zone has that the other doesn't if you run `data sync >> >> >> >>>>> init` >> >> >> >>>>> on >> >> >> >>>>> it? >> >> >> >>>>> >> >> >> >>>>> On Thu, Aug 24, 2017 at 1:51 PM David Turner >> >> >> >>>>> <drakonstein@xxxxxxxxx> >> >> >> >>>>> wrote: >> >> >> >>>>>> >> >> >> >>>>>> After restarting the 2 RGW daemons on the second site again, >> >> >> >>>>>> everything caught up on the metadata sync. Is there >> >> >> >>>>>> something >> >> >> >>>>>> about having >> >> >> >>>>>> 2 RGW daemons on each side of the multisite that might be >> >> >> >>>>>> causing >> >> >> >>>>>> an issue >> >> >> >>>>>> with the sync getting stale? I have another realm set up the >> >> >> >>>>>> same >> >> >> >>>>>> way that >> >> >> >>>>>> is having a hard time with its data shards being behind. I >> >> >> >>>>>> haven't >> >> >> >>>>>> told >> >> >> >>>>>> them to resync, but yesterday I noticed 90 shards were >> >> >> >>>>>> behind. >> >> >> >>>>>> It's caught >> >> >> >>>>>> back up to only 17 shards behind, but the oldest change not >> >> >> >>>>>> applied >> >> >> >>>>>> is 2 >> >> >> >>>>>> months old and no order of restarting RGW daemons is helping >> >> >> >>>>>> to >> >> >> >>>>>> resolve >> >> >> >>>>>> this. >> >> >> >>>>>> >> >> >> >>>>>> On Thu, Aug 24, 2017 at 10:59 AM David Turner >> >> >> >>>>>> <drakonstein@xxxxxxxxx> >> >> >> >>>>>> wrote: >> >> >> >>>>>>> >> >> >> >>>>>>> I have a RGW Multisite 10.2.7 set up for bi-directional >> >> >> >>>>>>> syncing. >> >> >> >>>>>>> This has been operational for 5 months and working fine. I >> >> >> >>>>>>> recently created >> >> >> >>>>>>> a new user on the master zone, used that user to create a >> >> >> >>>>>>> bucket, >> >> >> >>>>>>> and put in >> >> >> >>>>>>> a public-acl object in there. The Bucket created on the >> >> >> >>>>>>> second >> >> >> >>>>>>> site, but >> >> >> >>>>>>> the user did not and the object errors out complaining about >> >> >> >>>>>>> the >> >> >> >>>>>>> access_key >> >> >> >>>>>>> not existing. >> >> >> >>>>>>> >> >> >> >>>>>>> That led me to think that the metadata isn't syncing, while >> >> >> >>>>>>> bucket >> >> >> >>>>>>> and data both are. I've also confirmed that data is syncing >> >> >> >>>>>>> for >> >> >> >>>>>>> other >> >> >> >>>>>>> buckets as well in both directions. The sync status from the >> >> >> >>>>>>> second site was >> >> >> >>>>>>> this. >> >> >> >>>>>>> >> >> >> >>>>>>> metadata sync syncing >> >> >> >>>>>>> >> >> >> >>>>>>> full sync: 0/64 shards >> >> >> >>>>>>> >> >> >> >>>>>>> incremental sync: 64/64 shards >> >> >> >>>>>>> >> >> >> >>>>>>> metadata is caught up with master >> >> >> >>>>>>> >> >> >> >>>>>>> data sync source: f4c12327-4721-47c9-a365-86332d84c227 >> >> >> >>>>>>> (public-atl01) >> >> >> >>>>>>> >> >> >> >>>>>>> syncing >> >> >> >>>>>>> >> >> >> >>>>>>> full sync: 0/128 shards >> >> >> >>>>>>> >> >> >> >>>>>>> incremental sync: 128/128 shards >> >> >> >>>>>>> >> >> >> >>>>>>> data is caught up with source >> >> >> >>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>>> Sync status leads me to think that the second site believes >> >> >> >>>>>>> it >> >> >> >>>>>>> is >> >> >> >>>>>>> up >> >> >> >>>>>>> to date, even though it is missing a freshly created user. >> >> >> >>>>>>> I >> >> >> >>>>>>> restarted all >> >> >> >>>>>>> of the rgw daemons for the zonegroup, but it didn't trigger >> >> >> >>>>>>> anything to fix >> >> >> >>>>>>> the missing user in the second site. I did some googling >> >> >> >>>>>>> and >> >> >> >>>>>>> found the sync >> >> >> >>>>>>> init commands mentioned in a few ML posts and used metadata >> >> >> >>>>>>> sync >> >> >> >>>>>>> init and >> >> >> >>>>>>> now have this as the sync status. >> >> >> >>>>>>> >> >> >> >>>>>>> metadata sync preparing for full sync >> >> >> >>>>>>> >> >> >> >>>>>>> full sync: 64/64 shards >> >> >> >>>>>>> >> >> >> >>>>>>> full sync: 0 entries to sync >> >> >> >>>>>>> >> >> >> >>>>>>> incremental sync: 0/64 shards >> >> >> >>>>>>> >> >> >> >>>>>>> metadata is behind on 70 shards >> >> >> >>>>>>> >> >> >> >>>>>>> oldest incremental change not applied: >> >> >> >>>>>>> 2017-03-01 >> >> >> >>>>>>> 21:13:43.0.126971s >> >> >> >>>>>>> >> >> >> >>>>>>> data sync source: f4c12327-4721-47c9-a365-86332d84c227 >> >> >> >>>>>>> (public-atl01) >> >> >> >>>>>>> >> >> >> >>>>>>> syncing >> >> >> >>>>>>> >> >> >> >>>>>>> full sync: 0/128 shards >> >> >> >>>>>>> >> >> >> >>>>>>> incremental sync: 128/128 shards >> >> >> >>>>>>> >> >> >> >>>>>>> data is caught up with source >> >> >> >>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>>> It definitely triggered a fresh sync and told it to forget >> >> >> >>>>>>> about >> >> >> >>>>>>> what >> >> >> >>>>>>> it's previously applied as the date of the oldest change not >> >> >> >>>>>>> applied is the >> >> >> >>>>>>> day we initially set up multisite for this zone. The >> >> >> >>>>>>> problem >> >> >> >>>>>>> is >> >> >> >>>>>>> that was >> >> >> >>>>>>> over 12 hours ago and the sync stat hasn't caught up on any >> >> >> >>>>>>> shards >> >> >> >>>>>>> yet. >> >> >> >>>>>>> >> >> >> >>>>>>> Does anyone have any suggestions other than blast the second >> >> >> >>>>>>> site >> >> >> >>>>>>> and >> >> >> >>>>>>> set it back up with a fresh start (the only option I can >> >> >> >>>>>>> think >> >> >> >>>>>>> of >> >> >> >>>>>>> at this >> >> >> >>>>>>> point)? >> >> >> >>>>>>> >> >> >> >>>>>>> Thank you, >> >> >> >>>>>>> David Turner >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> _______________________________________________ >> >> >> >>>>> ceph-users mailing list >> >> >> >>>>> ceph-users@xxxxxxxxxxxxxx >> >> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> _______________________________________________ >> >> >> >>>>> ceph-users mailing list >> >> >> >>>>> ceph-users@xxxxxxxxxxxxxx >> >> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> _______________________________________________ >> >> >> >>>> ceph-users mailing list >> >> >> >>>> ceph-users@xxxxxxxxxxxxxx >> >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>>> >> >> >> > >> >> >> > _______________________________________________ >> >> >> > ceph-users mailing list >> >> >> > ceph-users@xxxxxxxxxxxxxx >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com