Re: Pitfalls when using RBD Snapshot as timely backup

Huang Zhiteng <winston.d@xxxxxxxxx> · Wed, 19 Feb 2020 02:50:38 +0800

On Wed, Feb 19, 2020 at 2:47 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
On Wed, 19 Feb 2020, Xiaoxi Chen wrote:

> Hi List,

>     We are using RBD Snapshots as timely backup for DBs,   24 hourly

> snapshot + 30 daily snapshots are taken for each RBDs.  It works perfect at

> the beginning however with the # of volumes increasing,  more and more

> significant pitfalls were seen.  we are at ~ 700 volumes which will create

> 700 snapshots and rotate 700 snapshots every hour.

> 

>    1.  Huge and frequent OSDMap update

> 

>           The OSDMap is ~640K in size , with a long and scattered

> "removed_snaps".  The holes in the removed_snap interval set are from two

> part,

> 

>    - In our use case as we keep daily snapshots for longer ,which turn out

>    to be a hole  in the removed_snap interval set for each daily snapshots.

>    -

>    https://github.com/ceph/ceph/blob/v14.2.4/src/osd/osd_types.cc#L1583-L1586

> add

>    a new snapid for each snapshot removal, according to the comment the new

>    snapid is intent to keep the interval_set contiguous.  However I cannot

>    understand how it works, it seems to me like this behavior is creating more

>    holes when create/delete interleaving with each other.

>    -  After processing 4 or 5 versions of map, the rocksdb write-ahead log

>    (WAL) is full and the corresponding memtable has to be flushed to disk.

What version are you running?  The removed_snaps code was reworked in 

octopus.  You should only see recently deleted snaps in the OSDMap.
We are running Nautilus. 

>         2.  pgstat update burn out MGR

> 

> starting from Mimic,  PG by default update 500

> (osd_max_snap_prune_intervals_per_epoch)

> purged_snapshot interval to MGR,  which significant inflate the size of

> pg_stat and causing MGR using 20GB+ Mem, 260%+ CPU(mostly on

> messenger threads and MGR_FIN thread), and very unresponsive.   Reduce

> the osd_max_snap_prune_intervals_per_epoch

> to 10 fix the issue in our env.

> 

>         3.  SnapTrim IO overhead

> 

> Though there are tuning knobs to control the speed of snaptrim however it

> anyway need to catch up with the snapshot creation speed.  What is more,

> the snaptrim introduce huge amplification in RocksDB WAL, maybe due to to

> 4K alignment in WAL.  We observed 156GB WAL was written during trimming 100

> snapshots,  however the generated L0 is 4.63GB which seems related with WAL

> page align amplification.   The PG purged snapshot from snaptrim_q one by

> one ,  we are thinking if several purged snapshots for a given volume, can

> be compacted and trim together, perhaps we can get better efficiency (we

> only need change snapset for given obj once).

> 

> 4. Deep-scurb on objects with hundreds of snapshots are super slow and

> resulting osd_op_w_latency surged up 10x in our env, not yet deep dived.

> 

> 5.  How cache tier works with snapshots?  does cache tier help with write

> performance in this case?

> 

> 

>       There are several outstanding PRs like

> https://github.com/ceph/ceph/pull/28330 to optimize the Snaptrim especially

> get rid of the removed_snaps,  we believe it will helps partly on #1 but

> not sure how significant it helps others.  As the env is a production env

> so upgrading to Octopus RC is not flexible at the moment,  will try out

> once stable released.

> 

> 

> -Xiaoxi

> 

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

-- 
Regards
Huang Zhiteng
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx