Hi, I work with Paweł on this subject. Do I understand correctly that not marking osd.1 as out but instead making it down first (ie. by killing process) would prevent this situation from happening? On 13.10.2016 18:53, Sage Weil wrote: > On Tue, 11 Oct 2016, Paweł Sadowski wrote: >> Hi Sage, >> >> W dniu 11.10.2016 o 15:45, Sage Weil pisze: >> >>> Hi Pawel, >>> >>> On Tue, 11 Oct 2016, Paweł Sadowski wrote: >>>> Hi, >>>> >>>> I managed to trigger unfound objects on a pool with size 3 and min_size >>>> 2 just by removing 'slow' OSD (out and then stop) which is quite >>>> frightening. Shouldn't Ceph stop IO if there is only one copy in this >>>> case (even during recovery/peering/etc)? I'm able to reproduce this on >>>> Hammer (0.94.5, 0.94.9) and Jewel (10.2.3). So far I wasn't able to >>>> trigger this behavior by just stopping such OSD (still testing). >>> This is definitely concerning. I have a couple questions... >>> >>> 1. Looking at the log, I see that at one point all of the OSDs mark >>> themselves down, here: >>> >>> 2016-10-11 08:46:23.473335 mon.0 10.99.128.50:6789/0 3275 : cluster [INF] >>> osd.1 marked itself down >>> >>> Do you know why they do that? >> >> It's probably due to load triggered by rebalance after marking OSD out. >> In other tests OSDs stayed up during this period. It didn't happen >> in my last test (logs posted via ceph-post-file). >> >>> 2. Are you throttling the CPU on just a single OSD, or on a whole host? I >>> also see that the monitors are calling elections. (This shouldn't have >>> anything to do with the problem, but I'm not sure I understand the test >>> setup.) >> >> I'm throttling CPU for single OSD process (in our production we had disk >> that was slowing down OSD). Monitors share device with one of the OSD. >> Sometimes election happens after marking OSD out, sometimes not. Probably >> caused by load on the underlaying device (didn't happen in last test). >> >>>> Second thing: throttling mechanism is blocking recovery operations/whole >>>> OSD[4] when there is a lot of client requests for missing objects. I >>>> think it shouldn't be like that. >>> Yeah, there is a similar problem with a PG is inactive and requests pile >>> up, eventually preventing ops even on active PGs... and 'ceph tell $pgid >>> query' admin commands. It's non-trivial to fix, though: we need a way to >>> inform the client that a PG or individual object is blocked so that they >>> stop sending requests... and then also a way to inform them that the PG is >>> unblocked so they can start again. >> >> OK. Is it wise to increase this throttling limit and if yes how much? >> Anyway, in the end it'll probably just delay moment until OSD is blocked. > > Yeah, I wouldn't increase it (at least not by much)... it'll just delay > the inevitable. > >>>> 1: logs from Jewel >>>> https://gist.github.com/anonymous/c8618adca8984132c82f16c351222883 >>> Do you mind reproducing this sequence with debug ms = 1, debug osd = 20, >>> and capture all of the OSD logs as well as the cluster ceph.log? You can >>> send us the tarball with the ceph-post-file utility. >> >> Sure. Gzipped logs are about ~2.7G, ceph-post-file id: >> 7e16f69e-c410-49c2-b11c-e727854ce7b3 > > Thanks! I took a look at this and I see what is going on. > It's... subtle. > > Initially the PG is, say, OSDs [1,2,3]. We are writing a sequence of > updates to existing objects. Eventually we mark osd.1 out. At that point > we have > > osd.1: object A version '23 in flight to disk and replicas > osd.2,3: object A version '22 (hasn't gotten replicated write yet) > > If osds 2 and 3 learn about the new osdmap (osd.1 is out) before osd.1, > they will toss the replicate write away when they get it (they've already > moved on and are part of the way through peering in the new [2,3] > interval. And during peering they learn from osd.1 that it has a newer > version of the object, '23. They assume this is the best one and include > the '23 write as part of the official history. > > All is well as long as osd.1 stays alive long enough for them to also > fetch version '23 of the object. If osd.1 is then killed before that > happens, though, we are in the situation where osds [2,3] know there was a > newer write but there is no longer a copy available. > > Since the '23 didn't actually complete (all replicas didn't commit it > so the client didn't get an ack) it is equally valid for us to throw it > out and say that '22 is the "official" version. But if we do that we > could fall into a similar situation where, say, osd.1 and osd.3 had > '23 and osd.2 had '22, we chose '22, osd.1 and 3 throw away their > 'divergent' '23, but then osd.2 fails before we copy '22 around. > > Somewhere in there is a better strategy that recognizes that either > version is okay and either picks the official version that is least likely > to lead to one of these failure cases *or* adds some complicated logic to > record that either is okay and adjust the official record > as-needed if one of those failure situations comes up. > > For now, the way to address this is to do the mark-unfound-revert thing, > which takes you from a situation where the history says '22 was modified > to get '23 but you don't have a '23 copy (only '22's). This is a manual > admin intervention, but it's generally safe and is designed to handle > exactly this case. > > Perhaps the simplest improvement would be to automatically recognized > that the '23 version was never acked to the client and automatic revert > to '22 is safe... > > Sam's on vacation, but I suspect he'll have some additional thoughts on > this when he gets back next week! > > sage > > > >> >>> Thanks! >>> sage >>> >>>> 2: steps to reproduce >>>> - put some load on the cluster (run FIO with high iodepth) >>>> - slow down single OSD (in my case reduce CPU time using cgroups: >>>> cpu.cfs_quota_us 15000) >>>> - sleep 120 >>>> - ceph osd out 6 >>>> - sleep 15 >>>> - stop ceph-osd id=6 >>>> - unfound objects appear >>>> >>>> This is not 100% reproducible but in my test lab (9 OSDs) I'm able to >>>> trigger this very easily. >>>> >>>> 3: >>>> mon-01:~ # ceph osd pool get rbd size >>>> size: 3 >>>> mon-01:~ # ceph osd pool get rbd min_size >>>> min_size: 2 >>>> mon-01:~ # ceph --version >>>> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) >>>> >>>> 4: >>>> perf dump | grep -A 2 'throttle-osd_client_messages' >>>> "throttle-osd_client_messages": { >>>> "val": 100, >>>> "max": 100, >>>> >>>> ops_in_flight: >>>> https://gist.github.com/anonymous/643607fa3f959c91ba7a9794e5d99dea >> >> Thanks! >> >> >> -- >> PS >> -- Tomasz Kuzemko tomasz.kuzemko@xxxxxxxxxxxx
Attachment:
signature.asc
Description: OpenPGP digital signature