On Thu, Mar 8, 2018 at 5:03 PM, Matt Benjamin <mbenjami@xxxxxxxxxx> wrote: > I have two lines of thought: > > 1. I intuit that more interaction with concrete workflows--e.g., from > a mature backup product in a multi-tenant deployment would be helpful > to firm up requirements > 1.1. it appears to me that we may be rushing to a specific design > without a lot of input from applications > > 2. I would like to consider approaches which do not rely on indexed > storage (i.e., omap); the overhead from indexed updates is currently > a disproportionate share of RGW workload cost, and I'd love to reduce > (and avoid increasing) it Do you have a specific solution in mind? Greg brought up this concern during CDM. I'd be happy to explore other solutions that would make sense and would give us the needed functionality. In any case, I don't think we should be taking the current cost and deficiencies of omap for granted and just try to work around it. We should strive to fix these. One thing to remember is that the sync module does not work in the IO critical path (not even running on the same zone), does not use the same pools, and does not necessarily run on the same ceph cluster of the zone(s) it tracks. Its effect on the backing rados cluster can be isolated if configured correctly (and/or if these are actually a problem). Yehuda > > Matt > > On Thu, Mar 8, 2018 at 7:51 PM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: >> This was discussed yesterday in the CDM. The need is for a log of all >> changes that happen on a specific rgw zone (and potentially in >> specific bucket[s]), so that this information can be used by external >> tools. One such example would be a backup tool that would use these >> logs to determine what objects changed and back everything new to a >> separate system (for redundancy). >> >> The original thought was to generate logs that are compatible with the >> S3 bucket logging. The S3 logging works by enabling logs on a specific >> bucket, and by specifying a specific bucket where the logs need to be >> uploaded to. The logs are being written to an object that is generated >> from time to time and keep a list of operations to the bucket. The >> object nameThe logs themselves keep a list of operations that happen >> on that specific bucket (which look pretty similar to access logs of a >> web server). After examining these logs we weren't sure that the >> specific logs format is really something that we should pursue. We can >> still have a similar basic mechanism (logs that hold aggregated list >> of changes, and are uploaded to an object in a bucket), but we can >> drop the specific log format (were thinking of json encoding the >> data). >> >> The proposal is as follows: >> >> We will create a sync module that would handle all objects >> modification operations. The module will be configured with a list of >> buckets and/or bucket prefixes for which we'd store info about newly >> created or modified objects. The configuration will also include S3 >> endpoint, access keys, and a bucket name (or other path config) into >> which the logs will be stored. The logs will be stored into a new >> object that will be created periodically. >> >> Implementation details: >> >> Whenever a sync module write operation is handled, we will store >> information about it in a temporary sharded omap index in the backing >> rados store. >> >> - Temporary index >> >> The index will keep information about all the recent objects changes >> in the system. One question is whether we need to keep more than one >> entry for a single object that is overwritten within the time window. >> One option is to generate a new entry per each write, but this is not >> going to be of much use (other than for auditing) as the overwritten >> data is lost at that point. In the simple case where we create one >> entry in the index per write we can just keep a single index (by >> monotonically increasing timestamp + object name + object version). If >> we only keep a single entry per object ( + object version), then we >> need to keep two indexes: one index by object + version, a second >> index by timestamp ( + object + version), so that we could remove the >> old entry. >> It should be possible to fetch keys + data out of these indexes for a >> specific timeframe (e.g., starting at a specific timestamp, and ending >> at a specific timestamp). >> >> - Collection thread >> >> A collection thread will run periodically. It will take a lease over a >> single control object that will guarantee that it is the only one that >> does this work. It will iterate over the shards and take a lease on >> them, and read all the info that was stored there up until a specific >> timestamp. This data will be formatted (json) and sent to the backend >> using the S3 api. If there is too much data we can flush it to the >> backend periodically using multipart upload. Once the object is >> created, the temporary index will be trimmed up to the specific >> timestamp. >> >> >> Any thoughts? >> >> Yehuda > > > > -- > > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-821-5101 > fax. 734-769-8938 > cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html