Suppose a client sends a write on object foo to osds 0, 1, 2. osds 1 and 2 are shut down, but not before osd 0 records the update to foo locally. At this point, osd 0 is informed that osds 1 and 2 have gone down. However, min_read_size=1, so it begins accepting reads (*). A client reads foo, so osd 0 returns the new state. osd 0 goes down. osd 1 and 2 come back up. The client reads foo again. osd 1 (now the primary) returns its copy of foo, but it is out of date resulting in an inconsistent read. Without min_read_size, (*) can't happen unless min_size=1, in which case (**) won't happen until osd.0 comes back up. On the flip side, with min_size=2, (*) will result in osd.0 finding another osd to bring up to date prior to accepting reads and writes (as there'd be enough copies then). If it dies prior to that point, at osd.1 and 2 can safely assume that their state must be new enough to safely serve IO. This is not the only valid choice in the design space, but they all involve tradeoffs. For instance, if we perform non-destructive writes and therefore have prior copies of foo, we could conceivably return a prior version of foo known to have been committed to min_size replicas, but that would in general require something like an additional distributed commit prior to acking to the client or serving reads (to ensure that all replicas have the new read bound) on the object in addition to the overhead of non-destructive writes (and I'm almost certainly missing things even so). -Sam On Tue, Jan 12, 2021 at 7:10 AM Prasad Krishnan <prasad.krishnan@xxxxxxxxxxxx> wrote: > > Hi Sam, > > Thank you for responding and apologies for missing out your reply.....noticed it recently. > > On Tue, Jan 5, 2021 at 6:16 AM Sam Just <sjust@xxxxxxxxxx> wrote: >> >> Part of the answer is that going "readable" with read_min_size >> replicas has a side effect of committing any writes those replicas >> happen to know about whether they were actually committed to >> write_min_size replicas or not because once we've served a read >> reflecting those writes, all future reads must also reflect those >> writes. > > > I'm wondering why it isn't a problem now? If there's a mechanism that > prevents uncommitted write transactions from being read now, I can't see > how read/write min_size separation would break that. > > Thanks, > Prasad Krishnan > >> >> -Sam >> >> On Thu, Dec 24, 2020 at 6:14 AM Prasad Krishnan >> <prasad.krishnan@xxxxxxxxxxxx> wrote: >> > >> > Dear Ceph developers, >> > >> > Presently Ceph has a single config option named min_size which decides the minimum number of copies that must be available before any client I/O operation (read or write) can be performed on a given RADOS pool. >> > >> > Would it make sense to split it into two i.e. read_min_size and write_min_size to allow better data availability? >> > >> > For instance, in a pool with replication size of 3 (where 3 copies are stored), if two OSDs go down, we would want to avoid client write operations (to reduce risk of data loss) but allow client read operations from the single copy that is available. This can be done by setting read_min_size to 1, but retaining write_min_size to 2. >> > >> > Are there any technical reasons why this cannot work? Any pitfalls that I don't foresee? >> > >> > Thanks, >> > K.Prasad >> > >> > >> > >> > >> > ----------------------------------------------------------------------------------------- >> > >> > This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. >> > >> > >> > >> > Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the organization. Any information on shares, debentures or similar instruments, recommended product pricing, valuations and the like are for information purposes only. It is not meant to be an instruction or recommendation, as the case may be, to buy or to sell securities, products, services nor an offer to buy or sell securities, products or services unless specifically stated to be so on behalf of the Flipkart group. Employees of the Flipkart group of companies are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to organizational policy and outside the scope of the employment of the individual concerned. The organization will not accept any liability in respect of such communication, and the employee responsib le will be personally liable for any damages or other liability arising. >> > >> > >> > >> > Our organization accepts no liability for the content of this email, or for the consequences of any actions taken on the basis of the information provided, unless that information is subsequently confirmed in writing. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. >> > >> > ----------------------------------------------------------------------------------------- >> > >> > _______________________________________________ >> > Dev mailing list -- dev@xxxxxxx >> > To unsubscribe send an email to dev-leave@xxxxxxx >> > > ----------------------------------------------------------------------------------------- > > This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. > > > > Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the organization. Any information on shares, debentures or similar instruments, recommended product pricing, valuations and the like are for information purposes only. It is not meant to be an instruction or recommendation, as the case may be, to buy or to sell securities, products, services nor an offer to buy or sell securities, products or services unless specifically stated to be so on behalf of the Flipkart group. Employees of the Flipkart group of companies are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to organizational policy and outside the scope of the employment of the individual concerned. The organization will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. > > > > Our organization accepts no liability for the content of this email, or for the consequences of any actions taken on the basis of the information provided, unless that information is subsequently confirmed in writing. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. > > ----------------------------------------------------------------------------------------- _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx