Removing Snapshots Killing Cluster Performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!


We take regular (nightly) snapshots of our Rados Gateway Pools for

backup purposes. This allows us - with some manual pokery - to restore

clients' documents should they delete them accidentally.


The cluster is a 4 server setup with 12x4TB spinning disks each,

totaling about 175TB. We are running firefly.


We have now completed our first month of snapshots and want to remove

the oldest ones. Unfortunately doing so practically kills everything

else that is using the cluster, because performance drops to almost zero

while the OSDs work their disks 100% (as per iostat). It seems this is

the same phenomenon I asked about some time ago where we were deleting

whole pools.


I could not find any way to throttle the background deletion activity

(the command returns almost immediately). Here is a graph the I/O

operations waiting (colored by device) while deleting a few snapshots.

Each of the "blocks" in the graph show one snapshot being removed. The

big one in the middle was a snapshot of the .rgw.buckets pool. It took

about 15 minutes during which basically nothing relying on the cluster

was working due to immense slowdowns. This included users getting 

kicked off their SSH sessions due to timeouts.


https://public.centerdevice.de/8c95f1c2-a7c3-457f-83b6-834688e0d048


While this is a big issue in itself for us, we would at least try to

estimate how long the process will take per snapshot / per pool. I

assume the time needed is a function of the number of objects that were

modified between two snapshots. We tried to get an idea of at least how

many objects were added/removed in total by running `rados df` with a

snapshot specified as a parameter, but it seems we still always get the

current values:


$ sudo rados -p .rgw df --snap backup-20141109

selected snap 13 'backup-20141109'

pool name       category                 KB      objects

.rgw            -                     276165      1368545


$ sudo rados -p .rgw df --snap backup-20141124

selected snap 28 'backup-20141124'

pool name       category                 KB      objects

.rgw            -                     276165      1368546


$ sudo rados -p .rgw df

pool name       category                 KB      objects

.rgw            -                     276165      1368547


So there are a few questions:


1) Is there any way to control how much such an operation will

tax the cluster (we would be happy to have it run longer, if that meant

not utilizing all disks fully during that time)?


2) Is there a way to get a decent approximation of how much work

deleting a specific snapshot will entail (in terms of objects, time,

whatever)?


3) Would SSD journals help here? Or any other hardware configuration

change for that matter?


4) Any other recommendations? We definitely need to remove the data,

not because of a lack of space (at least not at the moment), but because

when customers delete stuff / cancel accounts, we are obliged to remove

their data at least after a reasonable amount of time.


Cheers,

Daniel

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux