Re: Jewel Multisite RGW Memory Issues

Ben Agricola <maz@xxxxxxxx> · Fri, 08 Jul 2016 08:42:15 +0000

So I've narrowed this down a bit further, I *think* this is happening during bucket listing - I started a radosgw process with increased logging, and killed it as soon as I saw the RSS jump. This was accompanied by a ton of logs from 'RGWRados::cls_bucket_list' printing out the names of the files in one of the buckets - probably 5000 lines total.
The OP of the request that generated the bucket list was was '25RGWListBucket_ObjStore_S3', and appears to have been made by one of the RGW nodes in the other site.

Any ideas?

Ben.

On Mon, 27 Jun 2016 at 10:47 Ben Agricola <maz@xxxxxxxx> wrote:
Hi Pritha,

Urgh, not sure what happened to the formatting there - let's try again.

At the time, the 'primary' cluster (i.e. the one with the active data set) was receiving backup files from a small number of machines, prior to replication being enabled it was using ~10% RAM on the RadosGW boxes. 

Without replication enabled, neither cluster sees any spikes in memory usage under normal operation, with a slight increase when deep scrubbing (I'm monitoring cluster memory usage as a whole so OSD memory increases would account for that). 

Neither cluster was performing a deep scrub at the time. The 'secondary' cluster (i.e. the one I was trying to sync data to, which now has replication disabled again) has now had a RadosGW process running under normal load since June 17 with replication disabled and is using 1084M RSS. This matches with historical graphing for the primary cluster, which has hovered around 1G RSS for RadosGW processes for the last 6 months.

I've just tested this out this morning and enabling replication caused all RadosGW processes to increase in memory usage (and continue increasing) from ~1000M RSS to ~20G RSS in about 2 minutes. As soon as replication is enabled (as in, within seconds) RSS of RadosGW on both clusters starts to increase and does not drop. This appears to happen during metadata sync as well as during normal data syncing.

I then killed all RadosGW processes on the 'primary' side, and memory usage of the RadosGW processes on the 'secondary' side continue to increase in usage at the same rate. There are no further messages in the RadosGW log as this is occurring (since there is no client traffic and no further replication traffic). If I kill the active RadosGW processes then they start back up and normal memory usage resumes.
Cheers,

Ben.

On Mon, 27 Jun 2016 at 10:39 Ben Agricola <maz@xxxxxxxx> wrote:
Hi Pritha,

At the time, the 'primary' cluster (i.e. the one with the active data set) was receiving backup files from a small number of machines, prior to replication being 
enabled it was using ~10% RAM on the RadosGW boxes. 

Without replication enabled, neither cluster sees any spikes in memory usage under normal operation, with a slight increase when deep scrubbing (I'm monitoring
cluster memory usage as a whole so OSD memory increases would account for that). Neither cluster was performing a deep scrub at the time. The 'secondary' cluster 
(i.e. the one I was trying to sync data to, which now has replication disabled again) has now had a RadosGW process running under normal load since June 17 
with replication disabled and is using 1084M RSS. This matches with historical graphing for the primary cluster, which has hovered around 1G RSS for RadosGW
processes for the last 6 months.

I've just tested this out this morning and enabling replication caused all RadosGW processes to increase in memory usage (and continue increasing) from ~1000M RSS
to ~20G RSS in about 2 minutes. As soon as replication is enabled (as in, within seconds) RSS of RadosGW on both clusters starts to increase and does not drop. This
appears to happen during metadata sync as well as during normal data syncing as well.

I then killed all RadosGW processes on the 'primary' side, and memory usage of the RadosGW processes on the 'secondary' side continue to increase in usage at 
the same rate. There are no further messages in the RadosGW log as this is occurring (since there is no client traffic and no further replication traffic).
If I kill the active RadosGW processes then they start back up and normal memory usage resumes.
Cheers,
Ben.

----- Original Message -----
> From: "Pritha Srivastava" <prsrivas@...>
> To: ceph-users@...
> Sent: Monday, June 27, 2016 07:32:23
> Subject: Re:  Jewel Multisite RGW Memory Issues

> Do you know if the memory usage is high only during load from clients and is
> steady otherwise?
> What was the nature of the workload at the time of the sync operation?

> Thanks,
> Pritha
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com