This looks like a solid algorithm to accomplish the intended task and respect the various constraints imposed. Very nice!! Eric > On Nov 4, 2021, at 3:31 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote: > > # motivation > > historically, rgw has had several bugs that led to inconsistencies > with its 'bucket stats'. currently, the only way to rectify these > inconsistencies is the 'radosgw-admin bucket reshard' command, because > the act of resharding rebuilds these stats from scratch in each new > bucket index shard > > but because this relies on bucket resharding, it can't currently be > used in multisite configurations. and even once multisite does support > resharding, the act of resharding still requires radosgw to block > writes during the process. i think we can do better with a targeted > command like 'radosgw-admin bucket stats --reset-stats' to match our > existing 'radosgw-admin user stats --reset-stats' > > in https://github.com/ceph/ceph/pull/23586, Orit pursued an earlier > 'offline' design which required the shutdown of all radosgws in order > to rebuild a consistent view of the stats. this work was never > completed, and 'radosgw-admin bucket reshard' was used instead as a > workaround > > # requirements > > * reconciles the 'bucket stats' with a full listing of the bucket > * does not require bucket reshard > * does not require clients to stop i/o > * limits the number of bucket index entries per osd op to > 'osd_max_omap_entries_per_request' > * prevents racing reset-stats commands from corrupting the stats > > # design > > the stats of each bucket index shard object are stored separately by > cls_rgw in 'struct rgw_bucket_dir_header'. within each index shard, we > also track stats per category in member variable > 'std::map<RGWObjCategory, rgw_bucket_category_stats> stats'. these > stats are updated by cls_rgw as bucket index transactions complete. > the 'radosgw-admin bucket stats' command reads the stats from each > index shard, and sums them up for display > > i propose a new 'bucket stats --reset-stats' command that makes > consecutive calls to a new cls_rgw_recalc_stats() op, to eventually > list all of its bucket index entries, accumulate their stats in a > temporary map, then commit those updated stats once the listing > reaches the end > > to support other writes to the bucket index during this process, the > temporary map of stats is stored inside 'struct rgw_bucket_dir_header' > as 'std::map<RGWObjCategory, rgw_bucket_category_stats> recalc_stats', > so that bucket index transactions are able to update both the 'stats' > and 'recalc_stats'. these updates to 'recalc_stats' would be > conditional on the current position of the 'recalc_marker' - if the > index entry's key is less than 'recalc_marker', then > cls_rgw_recalc_stats() already missed this entry and we need to > account for it in 'recalc_stats'. otherwise, cls_rgw_recalc_stats() > will see this entry later in its listing and account for it then > > the new cls_rgw operation 'cls_rgw_recalc_stats()' implements the > logic for a single osd op. this takes as input the marker position to > resume its listing, and returns this updated marker position as output > (relying on LIBRADOS_OPERATION_RETURNVEC since this is a write > operation). the op itself just lists ~1000 omap keys, accumulates > their stats in 'recalc_stats', then writes the updated 'recalc_stats' > and 'recalc_marker' position to 'struct rgw_bucket_dir_header'. once > cls_rgw_recalc_stats() reaches the end of the listing, it can > overwrite 'stats' with 'recalc_stats', and clear > 'recalc_stats'/'recalc_marker' > > to handle racing invocations of the 'bucket stats --reset-stats' > command, cls_rgw_recalc_stats() requests with an empty marker will > always succeed and start with a fresh listing. but when resuming with > a non-empty marker, cls_rgw_recalc_stats() will compare that marker > against the stored 'recalc_marker', and return -ECANCELED if they > don't match to indicate a racing write. the end result is that new > invocations of 'bucket stats --reset-stats' will cancel any previous > invocations > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx