-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Looking through the logs, I don't see any attempt to try and prime the pg_temp. In the code if mon_osd_prime_pg_temp is true then it tries, but this is false by default. I'll keep looking into this. I hope it is something simple like this. -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWz0PjCRDmVDuy+mK58QAABYIQAMAGbrJ0d8gUJFwSxTJi +oJkfVgtSd4u6lPwc9KG1PNROPGNF2XzCvv9KfPjhzAR6obcyOWKbPZVJDn4 fro96cOmC27XUL3nQ10FiUHoJmlqxieF0KIRQ3R1qkiCF2PpJXPBOGDrvCX9 R3djgzB7FA8iO84RaKB96MO5CTzW7Ys+DynRWv32Jpt4D8/K+eLQEkF3uvuH eAGxhAS5RRv/ZEk+uUtPL3wTzFk/ZvcJPadJ5JpKRnIJMD0107jZ5H5ZCJGh Kw3pqgX6SBryIGZDddDfHUBqAXbqLxH/qbIX8rHte8s7rk/CKMvcXZ3ojL65 xx7J5ZhdK13dfTW+70qWZSpsdFd2KFO57p4pB7uAZPoe6HxMj6oHRvvtJZna DtaaIw4PuijrcMSQE7I0fv89cAHGr7rGFAUlnpqCUC4o8y4p/dN5pdWIl/x+ wXhtzWbfhaQE0ChvhzunwQp5vzyCqZNzq1TZ4tKw0vE5Wz8qFKVAvyUm7FBN ACPbWELnNXJbdX6Bv/xsFKeyCe+1ERfg9HBeuLY/ppx9IyLIjB4TMQeIJ4Du T08KqOgcCFvay9M2n8ufKNZQx/MYXEUpRyG2PezNeTZK+oNtg8JHiEPqFviP YHr1F06VrGLj+lJVYqC6AwWdXK3ZDw+ydZpRXIieKZ/K5jv1jg59YbwrxzlL w1Ew =cLMb -----END PGP SIGNATURE----- ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Feb 24, 2016 at 5:12 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Wed, 24 Feb 2016, Robert LeBlanc wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> I've been looking through the code the last couple of days trying to >> figure out how to reduce the start-up impact of OSD on busy clusters >> or OSDs that are constrained on RAM. What I see in the logs is right >> after "state: booting -> active" client OPs (both client and repops) >> are getting queued at the OSD even though none of the PGs are active. >> This causes at least 20 seconds of latency for the first OPs to get >> serviced (at least on my test cluster), and that is really bad for >> RBD. >> >> So I was looking for a way to delay the client I/O to the OSD until at >> least the PG was active. It would be nice if I could even delay the >> PGMap until after at least one attempt of recovery was tried so that >> on busy clusters there isn't a huge backlog of OPs blocked waiting for >> a long list of RepOps that have been queued from the newly started PGs >> who now have client OPs to service. >> >> My first thought was when a client OP came in and the PG was not >> active, then create a pg_temp and NACK the client (I think a pg_temp >> would make the client retry anyway). I thought this might be dirty and >> get some client OPs out of order and in any case add latency (bad for >> RBD). But on idle clusters, there wouldn't need to be a pg_temp for >> each PG. >> >> My second thought was in load_pgs create the pg_temp there so that at >> start_boot it would just have the "right" PGMap to begin with and the >> clients just keep using the original OSDs until later when the pg_temp >> is removed after the PG has been recovered. I have a feeling that this >> will not work because the other OSDs won't see that this OSD is up and >> ready to start recovery/backfill. If the other OSDs can know that this >> one is up because it contacts them and shares a map, will it suddenly >> start sending it repops? I'm really not sure on this and could use >> some help. >> >> If I can get this pg_temp thing to work, then it is possible to try >> and do much of the recovery as fast as possible before any client load >> even hits it. Since only recovery OPs would be hammering the disk, >> there is no need to slow it down, or at least throttle it so much. >> >> Here is a workflow that I can see working: >> 1. load_pgs - create pg_temp >> 2. start_boot >> 3. walk through each PG >> 3a. recover the PG, if it is the second iteration, activate the PG by >> removing the pg_temp >> 4. Loop through 3 twice, the first to catch up the PGs as fast as >> possible, the second to activate them. > > So, 1-3ish should sort of already be happening, but in a slightly > different way: mon/OSDMonitor.cc should be calling the > maybe_prime_pg_temp() and prime_pg_temp(), which should identify which PG > mappings are about to change and install pg_temp mappings with their prior > mappings. That way the same OSDMap epoch that shows the OSD coming up > *also* has the pg_temp keeping it out of the acting set. Then, when > peering completes and if backfill isn't necessary the mapping will be > removed. > > First, I would verify that this is actually happening.. it sounds like > maybe it isn't... and if not we should fix that. Maybe reproduce one of > these transitions (bringing a down OSD back up) and see what happens? > > But, assuming it is, I think the main thing that can be done is to change > the peering code to do any log based recovery too to the up but not acting > osd before removing the pg_temp mapping. That way, when the pg_temp > mapping is removed, it should be a very fast peering transition. > > I think the last thing that is probably slowing things down is that on any > peering transition there is the up_thru update that has to be applied on > the mon before the PG can go active. This can be slow. It's there so > that a series of OSDMap changes don't get us in a situation where there > are mappings that the map had that the OSDs never realized, the OSDs then > failed, and we cannot peer because we're stuck trying to probe OSDs > that never wrote data. (This is the maybe_went_rw logic in PG.cc and > osd_types.cc and we'll need to be very careful making optimizations here.) > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html