I have 2 distinct clusters configured, in 2 different locations, and 1 zonegroup. Cluster 1 has ~11TB of data currently on it, S3 / Swift backups via the duplicity backup tool - each file is 25Mb and probably 20% are multipart uploads from S3 (so 4Mb stripes) - in total 3217kobjects. This cluster has been running for months (without RGW replication) with no issue. Each site has 1 RGW instance at the moment. I recently set up the second cluster on identical hardware in a secondary site. I configured a multi-site setup, with both of these sites in an active-active configuration. The second cluster has no active data set, so I would expect site 1 to start mirroring to site 2 - and it does. Unfortunately as soon as the RGW syncing starts to run, the resident memory usage of radosgw instances on both clusters balloons massively until the process is OOMed. This isn't a slow leak - when testing I've found that the radosgw processes on either side can consume up to 300MB/s of extra RSS per *second*, completely ooming a machine with 96GB of ram in approximately 20 minutes. If I stop the radosgw processes on one cluster (i.e. breaking replication) then the memory usage of the radosgw processes on the other cluster stays at around 100-500MB and does not really increase over time. Obviously this makes multi-site replication completely unusable so wondering if anyone has a fix or workaround. I noticed some pull requests have been merged into the master branch for RGW memory leak fixes so I switched to v10.2.0-2453-g94fac96 from autobuild packages, it seems like this slows the memory increase slightly but not enough to make replication usable yet. I've tried valgrinding the radosgw process but doesn't come up with anything obviously leaking (I could be doing it wrong), but an example of the memory ballooning is captured by collectd: http://i.imgur.com/jePYnwz.png - this memory usage is *all* on the radosgw process RSS. Anyone else seen this? Do you know if the memory usage is high only during load from clients and is steady otherwise? What was the nature of the workload at the time of the sync operation? Thanks, Pritha _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com