Re: Pitfalls when using RBD Snapshot as timely backup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 19 Feb 2020, Xiaoxi Chen wrote:
> Hi List,
>     We are using RBD Snapshots as timely backup for DBs,   24 hourly
> snapshot + 30 daily snapshots are taken for each RBDs.  It works perfect at
> the beginning however with the # of volumes increasing,  more and more
> significant pitfalls were seen.  we are at ~ 700 volumes which will create
> 700 snapshots and rotate 700 snapshots every hour.
> 
>    1.  Huge and frequent OSDMap update
> 
>           The OSDMap is ~640K in size , with a long and scattered
> "removed_snaps".  The holes in the removed_snap interval set are from two
> part,
> 
>    - In our use case as we keep daily snapshots for longer ,which turn out
>    to be a hole  in the removed_snap interval set for each daily snapshots.
>    -
>    https://github.com/ceph/ceph/blob/v14.2.4/src/osd/osd_types.cc#L1583-L1586
> add
>    a new snapid for each snapshot removal, according to the comment the new
>    snapid is intent to keep the interval_set contiguous.  However I cannot
>    understand how it works, it seems to me like this behavior is creating more
>    holes when create/delete interleaving with each other.
>    -  After processing 4 or 5 versions of map, the rocksdb write-ahead log
>    (WAL) is full and the corresponding memtable has to be flushed to disk.

What version are you running?  The removed_snaps code was reworked in 
octopus.  You should only see recently deleted snaps in the OSDMap.
 
>         2.  pgstat update burn out MGR
> 
> starting from Mimic,  PG by default update 500
> (osd_max_snap_prune_intervals_per_epoch)
> purged_snapshot interval to MGR,  which significant inflate the size of
> pg_stat and causing MGR using 20GB+ Mem, 260%+ CPU(mostly on
> messenger threads and MGR_FIN thread), and very unresponsive.   Reduce
> the osd_max_snap_prune_intervals_per_epoch
> to 10 fix the issue in our env.
> 
>         3.  SnapTrim IO overhead
> 
> Though there are tuning knobs to control the speed of snaptrim however it
> anyway need to catch up with the snapshot creation speed.  What is more,
> the snaptrim introduce huge amplification in RocksDB WAL, maybe due to to
> 4K alignment in WAL.  We observed 156GB WAL was written during trimming 100
> snapshots,  however the generated L0 is 4.63GB which seems related with WAL
> page align amplification.   The PG purged snapshot from snaptrim_q one by
> one ,  we are thinking if several purged snapshots for a given volume, can
> be compacted and trim together, perhaps we can get better efficiency (we
> only need change snapset for given obj once).
> 
> 4. Deep-scurb on objects with hundreds of snapshots are super slow and
> resulting osd_op_w_latency surged up 10x in our env, not yet deep dived.
> 
> 5.  How cache tier works with snapshots?  does cache tier help with write
> performance in this case?
> 
> 
>       There are several outstanding PRs like
> https://github.com/ceph/ceph/pull/28330 to optimize the Snaptrim especially
> get rid of the removed_snaps,  we believe it will helps partly on #1 but
> not sure how significant it helps others.  As the env is a production env
> so upgrading to Octopus RC is not flexible at the moment,  will try out
> once stable released.
> 
> 
> -Xiaoxi
> 
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux