I'm not a big expert but the OP said he's suspecting bitrot is at least part of issue in which case you can have the situation where the drive has ACK'ed the write but a later scrub discovered checksum errors Plus you don't need to actually loose a drive to get inconsistent pgs with size=2 min_size=1 > flapping OSDs (even just temporary) while the cluster is receiving writes can generate this. On Fri, Nov 3, 2017 at 12:05 PM, Denes Dolhay <denke@xxxxxxxxxxxx> wrote: > Hi Greg, > > Accepting the fact, that an osd with outdated data can never accept write, > or io of any kind, how is it possible, that the system goes into this state? > > -All osds are Bluestore, checksum, mtime etc. > > -All osds are up and in > > -No hw failures, lost disks, damaged journals or databases etc. > > -The data became inconsistent > > > Thanks, > > Denke. > > > On 11/02/2017 11:51 PM, Gregory Farnum wrote: > > > On Thu, Nov 2, 2017 at 1:21 AM koukou73gr <koukou73gr@xxxxxxxxx> wrote: >> >> The scenario is actually a bit different, see: >> >> Let's assume size=2, min_size=1 >> -We are looking at pg "A" acting [1, 2] >> -osd 1 goes down >> -osd 2 accepts a write for pg "A" >> -osd 2 goes down >> -osd 1 comes back up, while osd 2 still down >> -osd 1 has no way to know osd 2 accepted a write in pg "A" >> -osd 1 accepts a new write to pg "A" >> -osd 2 comes back up. >> >> bang! osd 1 and 2 now have different views of pg "A" but both claim to >> have current data. > > > In this case, OSD 1 will not accept IO precisely because it can not prove it > has the current data. That is the basic purpose of OSD peering and holds in > all cases. > -Greg > >> >> >> -K. >> >> On 2017-11-01 20:27, Denes Dolhay wrote: >> > Hello, >> > >> > I have a trick question for Mr. Turner's scenario: >> > Let's assume size=2, min_size=1 >> > -We are looking at pg "A" acting [1, 2] >> > -osd 1 goes down, OK >> > -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1, >> > OK >> > -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is >> > incomplete and stopped) not OK, but this is the case... >> > --> In this event, why does osd 1 accept IO to pg "A" knowing full well, >> > that it's data is outdated and will cause an inconsistent state? >> > Wouldn't it be prudent to deny io to pg "A" until either >> > -osd 2 comes back (therefore we have a clean osd in the acting group) >> > ... backfill would continue to osd 1 of course >> > -or data in pg "A" is manually marked as lost, and then continues >> > operation from osd 1 's (outdated) copy? >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com