rgw: design proposal for 'bucket stats --reset-stats'

Casey Bodley <cbodley@xxxxxxxxxx> · Thu, 4 Nov 2021 15:31:54 -0400

# motivation

historically, rgw has had several bugs that led to inconsistencies
with its 'bucket stats'. currently, the only way to rectify these
inconsistencies is the 'radosgw-admin bucket reshard' command, because
the act of resharding rebuilds these stats from scratch in each new
bucket index shard

but because this relies on bucket resharding, it can't currently be
used in multisite configurations. and even once multisite does support
resharding, the act of resharding still requires radosgw to block
writes during the process. i think we can do better with a targeted
command like 'radosgw-admin bucket stats --reset-stats' to match our
existing 'radosgw-admin user stats --reset-stats'

in https://github.com/ceph/ceph/pull/23586, Orit pursued an earlier
'offline' design which required the shutdown of all radosgws in order
to rebuild a consistent view of the stats. this work was never
completed, and 'radosgw-admin bucket reshard' was used instead as a
workaround

# requirements

* reconciles the 'bucket stats' with a full listing of the bucket
* does not require bucket reshard
* does not require clients to stop i/o
* limits the number of bucket index entries per osd op to
'osd_max_omap_entries_per_request'
* prevents racing reset-stats commands from corrupting the stats

# design

the stats of each bucket index shard object are stored separately by
cls_rgw in 'struct rgw_bucket_dir_header'. within each index shard, we
also track stats per category in member variable
'std::map<RGWObjCategory, rgw_bucket_category_stats> stats'. these
stats are updated by cls_rgw as bucket index transactions complete.
the 'radosgw-admin bucket stats' command reads the stats from each
index shard, and sums them up for display

i propose a new 'bucket stats --reset-stats' command that makes
consecutive calls to a new cls_rgw_recalc_stats() op, to eventually
list all of its bucket index entries, accumulate their stats in a
temporary map, then commit those updated stats once the listing
reaches the end

to support other writes to the bucket index during this process, the
temporary map of stats is stored inside 'struct rgw_bucket_dir_header'
as 'std::map<RGWObjCategory, rgw_bucket_category_stats> recalc_stats',
so that bucket index transactions are able to update both the 'stats'
and 'recalc_stats'. these updates to 'recalc_stats' would be
conditional on the current position of the 'recalc_marker' - if the
index entry's key is less than 'recalc_marker', then
cls_rgw_recalc_stats() already missed this entry and we need to
account for it in 'recalc_stats'. otherwise, cls_rgw_recalc_stats()
will see this entry later in its listing and account for it then

the new cls_rgw operation 'cls_rgw_recalc_stats()' implements the
logic for a single osd op. this takes as input the marker position to
resume its listing, and returns this updated marker position as output
(relying on LIBRADOS_OPERATION_RETURNVEC since this is a write
operation). the op itself just lists ~1000 omap keys, accumulates
their stats in 'recalc_stats', then writes the updated 'recalc_stats'
and 'recalc_marker' position to 'struct rgw_bucket_dir_header'. once
cls_rgw_recalc_stats() reaches the end of the listing, it can
overwrite 'stats' with 'recalc_stats', and clear
'recalc_stats'/'recalc_marker'

to handle racing invocations of the 'bucket stats --reset-stats'
command, cls_rgw_recalc_stats() requests with an empty marker will
always succeed and start with a fresh listing. but when resuming with
a non-empty marker, cls_rgw_recalc_stats() will compare that marker
against the stored 'recalc_marker', and return -ECANCELED if they
don't match to indicate a racing write. the end result is that new
invocations of 'bucket stats --reset-stats' will cancel any previous
invocations

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx