On Wed, Feb 19, 2020 at 2:47 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
On Wed, 19 Feb 2020, Xiaoxi Chen wrote:
> Hi List,
> We are using RBD Snapshots as timely backup for DBs, 24 hourly
> snapshot + 30 daily snapshots are taken for each RBDs. It works perfect at
> the beginning however with the # of volumes increasing, more and more
> significant pitfalls were seen. we are at ~ 700 volumes which will create
> 700 snapshots and rotate 700 snapshots every hour.
>
> 1. Huge and frequent OSDMap update
>
> The OSDMap is ~640K in size , with a long and scattered
> "removed_snaps". The holes in the removed_snap interval set are from two
> part,
>
> - In our use case as we keep daily snapshots for longer ,which turn out
> to be a hole in the removed_snap interval set for each daily snapshots.
> -
> https://github.com/ceph/ceph/blob/v14.2.4/src/osd/osd_types.cc#L1583-L1586
> add
> a new snapid for each snapshot removal, according to the comment the new
> snapid is intent to keep the interval_set contiguous. However I cannot
> understand how it works, it seems to me like this behavior is creating more
> holes when create/delete interleaving with each other.
> - After processing 4 or 5 versions of map, the rocksdb write-ahead log
> (WAL) is full and the corresponding memtable has to be flushed to disk.
What version are you running? The removed_snaps code was reworked in
octopus. You should only see recently deleted snaps in the OSDMap.
We are running Nautilus.
> 2. pgstat update burn out MGR
>
> starting from Mimic, PG by default update 500
> (osd_max_snap_prune_intervals_per_epoch)
> purged_snapshot interval to MGR, which significant inflate the size of
> pg_stat and causing MGR using 20GB+ Mem, 260%+ CPU(mostly on
> messenger threads and MGR_FIN thread), and very unresponsive. Reduce
> the osd_max_snap_prune_intervals_per_epoch
> to 10 fix the issue in our env.
>
> 3. SnapTrim IO overhead
>
> Though there are tuning knobs to control the speed of snaptrim however it
> anyway need to catch up with the snapshot creation speed. What is more,
> the snaptrim introduce huge amplification in RocksDB WAL, maybe due to to
> 4K alignment in WAL. We observed 156GB WAL was written during trimming 100
> snapshots, however the generated L0 is 4.63GB which seems related with WAL
> page align amplification. The PG purged snapshot from snaptrim_q one by
> one , we are thinking if several purged snapshots for a given volume, can
> be compacted and trim together, perhaps we can get better efficiency (we
> only need change snapset for given obj once).
>
> 4. Deep-scurb on objects with hundreds of snapshots are super slow and
> resulting osd_op_w_latency surged up 10x in our env, not yet deep dived.
>
> 5. How cache tier works with snapshots? does cache tier help with write
> performance in this case?
>
>
> There are several outstanding PRs like
> https://github.com/ceph/ceph/pull/28330 to optimize the Snaptrim especially
> get rid of the removed_snaps, we believe it will helps partly on #1 but
> not sure how significant it helps others. As the env is a production env
> so upgrading to Octopus RC is not flexible at the moment, will try out
> once stable released.
>
>
> -Xiaoxi
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
--
Regards
Huang Zhiteng
Huang Zhiteng
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx