Hi everyone, with the Luminous release out the door and the Labor Day weekend over, I hope I can kick off a discussion on another issue that has irked me a bit for quite a while. There doesn't seem to be a good documented answer to this: what are Ceph's real limits when it comes to RBD snapshots? For most people, any RBD image will have perhaps a single-digit number of snapshots. For example, in an OpenStack environment we typically have one snapshot per Glance image, a few snapshots per Cinder volume, and perhaps a few snapshots per ephemeral Nova disk (unless clones are configured to flatten immediately). Ceph generally performs well under those circumstances. However, things sometimes start getting problematic when RBD snapshots are generated frequently, and in an automated fashion. I've seen Ceph operators configure snapshots on a daily or even hourly basis, typically when using snapshots as a backup strategy (where they promise to allow for very short RTO and RPO). In combination with thousands or maybe tens of thousands of RBDs, that's a lot of snapshots. And in such scenarios (and only in those), users have been bitten by a few nasty bugs in the past — here's an example where the OSD snap trim queue went berserk in the event of lots of snapshots being deleted: http://tracker.ceph.com/issues/9487 https://www.spinics.net/lists/ceph-devel/msg20470.html It seems to me that there still isn't a good recommendation along the lines of "try not to have more than X snapshots per RBD image" or "try not to have more than Y snapshots in the cluster overall". Or is the "correct" recommendation actually "create as many snapshots as you might possibly want, none of that is allowed to create any instability nor performance degradation and if it does, that's a bug"? Looking forward to your thoughts. Thanks in advance! Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com