On Wed, 19 Feb 2020, Xiaoxi Chen wrote: > Hi List, > We are using RBD Snapshots as timely backup for DBs, 24 hourly > snapshot + 30 daily snapshots are taken for each RBDs. It works perfect at > the beginning however with the # of volumes increasing, more and more > significant pitfalls were seen. we are at ~ 700 volumes which will create > 700 snapshots and rotate 700 snapshots every hour. > > 1. Huge and frequent OSDMap update > > The OSDMap is ~640K in size , with a long and scattered > "removed_snaps". The holes in the removed_snap interval set are from two > part, > > - In our use case as we keep daily snapshots for longer ,which turn out > to be a hole in the removed_snap interval set for each daily snapshots. > - > https://github.com/ceph/ceph/blob/v14.2.4/src/osd/osd_types.cc#L1583-L1586 > add > a new snapid for each snapshot removal, according to the comment the new > snapid is intent to keep the interval_set contiguous. However I cannot > understand how it works, it seems to me like this behavior is creating more > holes when create/delete interleaving with each other. > - After processing 4 or 5 versions of map, the rocksdb write-ahead log > (WAL) is full and the corresponding memtable has to be flushed to disk. What version are you running? The removed_snaps code was reworked in octopus. You should only see recently deleted snaps in the OSDMap. > 2. pgstat update burn out MGR > > starting from Mimic, PG by default update 500 > (osd_max_snap_prune_intervals_per_epoch) > purged_snapshot interval to MGR, which significant inflate the size of > pg_stat and causing MGR using 20GB+ Mem, 260%+ CPU(mostly on > messenger threads and MGR_FIN thread), and very unresponsive. Reduce > the osd_max_snap_prune_intervals_per_epoch > to 10 fix the issue in our env. > > 3. SnapTrim IO overhead > > Though there are tuning knobs to control the speed of snaptrim however it > anyway need to catch up with the snapshot creation speed. What is more, > the snaptrim introduce huge amplification in RocksDB WAL, maybe due to to > 4K alignment in WAL. We observed 156GB WAL was written during trimming 100 > snapshots, however the generated L0 is 4.63GB which seems related with WAL > page align amplification. The PG purged snapshot from snaptrim_q one by > one , we are thinking if several purged snapshots for a given volume, can > be compacted and trim together, perhaps we can get better efficiency (we > only need change snapset for given obj once). > > 4. Deep-scurb on objects with hundreds of snapshots are super slow and > resulting osd_op_w_latency surged up 10x in our env, not yet deep dived. > > 5. How cache tier works with snapshots? does cache tier help with write > performance in this case? > > > There are several outstanding PRs like > https://github.com/ceph/ceph/pull/28330 to optimize the Snaptrim especially > get rid of the removed_snaps, we believe it will helps partly on #1 but > not sure how significant it helps others. As the env is a production env > so upgrading to Octopus RC is not flexible at the moment, will try out > once stable released. > > > -Xiaoxi > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx