Re: Jewel Multisite RGW Memory Issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Corrected the formatting of the e-mail sent earlier.

----- Original Message -----
> From: "Pritha Srivastava" <prsrivas@xxxxxxxxxx>
> To: ceph-users@xxxxxxxxxxxxxx
> Sent: Monday, June 27, 2016 9:15:36 AM
> Subject: Re:  Jewel Multisite RGW Memory Issues
> 
> 
> I have 2 distinct clusters configured, in 2 different locations, and 1
> zonegroup.
> 
> Cluster 1 has ~11TB of data currently on it, S3 / Swift backups via
> the duplicity backup tool - each file is 25Mb and probably 20% are
> multipart uploads from S3 (so 4Mb stripes) - in total 3217kobjects.
> This cluster has been running for months (without RGW replication)
> with no issue. Each site has 1 RGW instance at the moment.
> 
> I recently set up the second cluster on identical hardware in a
> secondary site. I configured a multi-site setup, with both of these
> sites in an active-active configuration. The second cluster has no
> active data set, so I would expect site 1 to start mirroring to site 2
> - and it does.
> 
> Unfortunately as soon as the RGW syncing starts to run, the resident
> memory usage of radosgw instances on both clusters balloons massively
> until the process is OOMed. This isn't a slow leak - when testing I've
> found that the radosgw processes on either side can consume up to
> 300MB/s of extra RSS per *second*, completely ooming a machine with
> 96GB of ram in approximately 20 minutes.
> 
> If I stop the radosgw processes on one cluster (i.e. breaking
> replication) then the memory usage of the radosgw processes on the
> other cluster stays at around 100-500MB and does not really increase
> over time.
> 
> Obviously this makes multi-site replication completely unusable so
> wondering if anyone has a fix or workaround. I noticed some pull
> requests have been merged into the master branch for RGW memory leak
> fixes so I switched to v10.2.0-2453-g94fac96 from autobuild packages,
> it seems like this slows the memory increase slightly but not enough
> to make replication usable yet.
> 
> I've tried valgrinding the radosgw process but doesn't come up with
> anything obviously leaking (I could be doing it wrong), but an example
> of the memory ballooning is captured by collectd:
> http://i.imgur.com/jePYnwz.png - this memory usage is *all* on the
> radosgw process RSS.
> 
> Anyone else seen this?

Do you know if the memory usage is high only during load from clients and is
steady otherwise?
What was the nature of the workload at the time of the sync operation?

Thanks,
Pritha
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux