-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I've been looking through the code the last couple of days trying to figure out how to reduce the start-up impact of OSD on busy clusters or OSDs that are constrained on RAM. What I see in the logs is right after "state: booting -> active" client OPs (both client and repops) are getting queued at the OSD even though none of the PGs are active. This causes at least 20 seconds of latency for the first OPs to get serviced (at least on my test cluster), and that is really bad for RBD. So I was looking for a way to delay the client I/O to the OSD until at least the PG was active. It would be nice if I could even delay the PGMap until after at least one attempt of recovery was tried so that on busy clusters there isn't a huge backlog of OPs blocked waiting for a long list of RepOps that have been queued from the newly started PGs who now have client OPs to service. My first thought was when a client OP came in and the PG was not active, then create a pg_temp and NACK the client (I think a pg_temp would make the client retry anyway). I thought this might be dirty and get some client OPs out of order and in any case add latency (bad for RBD). But on idle clusters, there wouldn't need to be a pg_temp for each PG. My second thought was in load_pgs create the pg_temp there so that at start_boot it would just have the "right" PGMap to begin with and the clients just keep using the original OSDs until later when the pg_temp is removed after the PG has been recovered. I have a feeling that this will not work because the other OSDs won't see that this OSD is up and ready to start recovery/backfill. If the other OSDs can know that this one is up because it contacts them and shares a map, will it suddenly start sending it repops? I'm really not sure on this and could use some help. If I can get this pg_temp thing to work, then it is possible to try and do much of the recovery as fast as possible before any client load even hits it. Since only recovery OPs would be hammering the disk, there is no need to slow it down, or at least throttle it so much. Here is a workflow that I can see working: 1. load_pgs - create pg_temp 2. start_boot 3. walk through each PG 3a. recover the PG, if it is the second iteration, activate the PG by removing the pg_temp 4. Loop through 3 twice, the first to catch up the PGs as fast as possible, the second to activate them. The idea is that hopefully the second iteration will be pretty quick as the amount of change would be a lot less and that would mean less recovery ops being promoted in the cluster reducing the impact overall. Currently, I think backfills could remain the same as they already do this pg_temp thing, but I for one wouldn't mind an option to do all the backfills before starting any I/O on that OSD. That way adding a brand new OSD won't have as much impact on client performance (some impact will happen due to reads from other OSDs, but the one OSD receiving the backfill won't have to try and service client I/O too). If you can help me expand my understanding in this area, I would be most grateful. If you have suggestions on how to approach the implementation, I'd appreciate that too. Thanks, - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWzj9rCRDmVDuy+mK58QAAXmUP/235O7iDg5l61e6heDeK u9rQEUCaO9llhUnWlwjZxLkuedFqCNk9fMa7xdOsZxEWCfk8UhG0AjoBAq5m QNnGZDeL72AHD12gZ5cpmXXx5Dfa3EdnuOJzzcAqucTp3azxbnfGj80HSw1w fEYDP8UTVV0jJa+UyX/GHQ6TyyHn9BeZTXEDQ1CtDQvbt2V3fRmDcw7JkhXS OezhkvykQvQ6nk3wHa+LMZHgEb2kutHgNqBTFqiQs6psrgQrBAV+UzqpH4O3 jRpCx6vEx16qQqNgwn7YJNqbKj8moQiaSpd20GAWRwonGb/ztneZsZCGkcQl qRqRUtBMBrAZk/uxrTs3ro4nkuh061Dm6uIWaKckIcU5gTYAsWdsl8S9ihtY 35qhW/9IgWd0zdyF5txk0NZriZH245TTyBMC4HamT4EA4rzHGBoZwBvLi8XF VwgPFE35Ks5xQj3Yv+mTRVNA/kJaCEE5OFHeAGa3Kh+VdboTRR35esLDtlet bkoXOYobVgAXl5BQ9U2+Yi5TnnapfNLV5i0NvXf3ajyOtM2SyoehI2YFkr8E XFmkxZi6xKud1Vvec/1d9i+nBXKZnS4DNNeSSrHdWhUq/NMClAw/BWjEPweY B6eBJYT6h5ZenEgbGe99cTHLt433z2+upIqClqKviZ4gVEPpzodY2hEsfbej dTRs =BdW9 -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html