Re: [EXTERNAL] Re: Multisite Pubsub - Duplicates Growing Uncontrollably

Alex Hussein-Kershaw <alexhus@xxxxxxxxxxxxx> · Tue, 19 Oct 2021 17:09:02 +0000

Hi Yuval,

Thanks again for the info, also for opening the tracker issue, we'll keep an eye on that/update with comments as we progress this issue. 

We're using two pubsub zones as we have two clients working in sync (using both S3 and CephFS to provide storage for complex/historical reasons that predate me), splitting into two zones achieves this for us. 

Interested to hear how the investigation into the two pubsub sites goes, sounds like we're in the minority using pubsub and even more so using 2 sites. Please let me know if I can provide any further useful details. 

Meanwhile we also had some sync issues on our cluster that gave me an idea: if there had been a network issue/failed sync, could this cause the same behaviour as the RGW restarts that you described? For example, the sync of an object is repeatedly failing and retrying due to some issue at the network layer, but is getting far enough to cause a pubsub event such that we end up with the duplicates?  

I saw your email to Dave - great to hear that the pull functionality won't be lost should pubsub be deprecated!

Best wishes,
Alex

-----Original Message-----
From: Yuval Lifshitz <ylifshit@xxxxxxxxxx> 
Sent: 18 October 2021 14:51
To: Alex Kershaw <alex.kershaw4@xxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxx>
Subject: [EXTERNAL]  Re: Multisite Pubsub - Duplicates Growing Uncontrollably

Hi Alex,

I also seemed to miss your email :-)

On Mon, Oct 18, 2021 at 11:32 AM Alex Kershaw <alex.kershaw4@xxxxxxxxx>
wrote:

> Hi Yuval,
>
> Apologies - I'm having some trouble with my microsoft spam filter and 
> I'm not sure this email reached you. If it did please excuse the duplicate.
> This is in response to:
> "Multisite Pubsub - Duplicates Growing Uncontrollably":
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.ceph.io%2Fhyperkitty%2Flist%2Fceph-users%40ceph.io%2Fthread%2FDPPEPY
> PAWLQIRPRZAEJAWJ72S2W6INNN%2F&amp;data=04%7C01%7CALEXHUS%40microsoft.c
> om%7C44345e745b7c4dd05db808d9923e7b8f%7C72f988bf86f141af91ab2d7cd011db
> 47%7C1%7C0%7C637701619358429602%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=
> QPMHr1PyqziwgCkHFOnYxI2gSXkMLKR6WrderEqhLXs%3D&amp;reserved=0
> .
>
> --------------------------------------------------------------------
>
> Hi Yuval,
>
>
>
> Thanks for the reply. Oddly it had not come through to my inbox and 
> I’ve only just spotted it.
>
>
>
> We have 4 total zones, siteA, siteB, siteApubsub and siteBpubsub.
> Interesting that there is an issue, is there a ceph tracker ticket for 
> this so I can keep an eye on it?
>

just opened a tracker https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F52963&amp;data=04%7C01%7CALEXHUS%40microsoft.com%7C44345e745b7c4dd05db808d9923e7b8f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637701619358429602%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=I5ZINkRzqSFP8XMbscSVkGjMjmW4PE51MQCfg8I5RcM%3D&amp;reserved=0
feel free to comments there.

> As you mentioned sounds like this isn’t the cause though.
>
>
>

i never tried pubsub with more than one pubsub zone. will investigate if this is the root cause.
BTW, assuming all zones are in the same zonegroup, why do you have 2 pubsub zones?

I have verified these are the same events yes – for the most duplicated
> event, every single mtime attribute is the same. I don’t see an etag 
> field, but everything is the same between separate events referencing 
> the same object except the timestamp + id field.  The data I see looks like this:
>
>
>
>    {
>
>       "id": "1633954796.196156.775b109c",
>
>       "event": "OBJECT_CREATE",
>
>       "timestamp": "2021-10-11T12:19:56.196156Z",
>
>       "info": {
>
>         "attrs": {
>
>           "mtime": "2020-08-10T16:10:48.749795Z"
>
>         },
>
>         "bucket": {
>
>           "bucket_id": 
> "b72446af-3ff1-4164-b91e-5bf72d72c2a9.8443461.1",
>
>           "name": "albansstack-scsdata",
>
>           "tenant": ""
>
>         },
>
>         "key": {
>
>           "instance": "4redKHrSif4Bs6nxWVRMHrWC5G1Quxt",
>
>           "name": "61/00/2020020801511142F85432289692-Subscriber"
>
>         }
>
>       }
>
>     },
>

agree. if mtime is the same then it is probably the same object. as the "timestamp" and "event-id" are generated when the event is sent.

>
>
>
> Your comment on the RGW restarts is interesting, but we’re not 
> restarting these anymore – however I’m still seeing objects that I’m 
> not expecting. I had a look at the RGW logs and don’t see anything 
> implying RGW sync isn’t functioning as normal.
>
>
>
> The biggest surprise to me is that the mtimes of the objects are all old.
> My cluster’s “radosgw-admin sync status” was reporting that the data 
> sync was completed this morning, and I manually acked everything in 
> the pubsub queue. Now I am seeing more pubsub events with mtimes such as:
> "2020-08-10T16:10:48.749795Z" as above – I’m curious as to why this 
> can appear in pubsub – I think the mtime is saying this object hasn’t 
> been updated since 2020-08-10, so why is it on the pubsub queue at all 
> if I had a complete sync this morning and an empty queue? Perhaps I’m 
> misunderstanding something here, any insight you can provide is 
> greatly appreciated 😊
>
>
>
> We’re using pubsub based notifications as our design makes use of both 
> getting kicks to an endpoint and using the API to retrieve a queue of 
> all unacknowledged events (it’s important for us that we don’t miss 
> any events – even if our product goes down temporarily).  I think this 
> reasoning is inline with the doc you linked.
>

agree that this is an important feature, as it allows you to overcome outages not only in ceph, but also in your system.
hence the idea is not to deprecate the "pull" functionality - but to replace it with a mechanism unrelated to zone syncing.

> I actually spotted your email regarding pubsub deprecation (which I 
> presume is the reason for asking) just this morning – I think someone 
> from my team was intending to get in touch with you regarding this.
>
>
>

I replied to Dave Piper on an email thread titled "RGW pubsub deprecation".
also with the drawbacks of the mechanism being based on zone synching

> Thanks,
>
> Alex
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx