Re: Trying to reduce OSD start-up impact on busy clusters

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Thu, 25 Feb 2016 13:18:28 -0700



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

vstart.sh sets it to true, so it is being run there! I'm going to do
some testing with this turned on once my test cluster gets healthy
again. I've confirmed in the mon logs that with it set to true is
setting the pg_temp. I just need to see if it helps with the load and
RAM usage or if I need to try and delay the unsetting of pg_temp until
later in the process. I'm concerned that there is still too much
contention in the OSD/disk until all the recovery is done and trying
to do recovery and normal OPs at the same time will still lead to slow
I/O, only delayed a bit.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWz2GMCRDmVDuy+mK58QAAdXEP/Rq9ZvWfaHVlbC5xRQnC
/Pn8LFaJdK2fBUepfOtyky4dLW8U5VyBLm3vQcXizJEvqfCAHhS6lBTbHlEh
LuVVL/P7aBbRkLLV/QGjLvNZT2TNsO/qPvIC+YGECfwh4jngipov2Z9yJ99W
J89PlOtkM+MvGnYKez5LdDh+m5c+cSDWd8nZESU9oLyEfFrrx9XVsHlVwm+y
V42ncUP4OguUo05LdlDR/HvS7uKz2KwVMkk9RYvsyOqnMLd6+8PVfKAaTFi6
CxHQ+bGlWEzFoxxVLs80Hz5j8hvSYR+Mb3CmLoEGpGGauMa8CC3gNzqFxyf8
3d4Tj93jXZZg3PMaJqUrDKYRvnBe/L6mOTutpL/+76mu2Bbo4nzZr3++JGH+
seh6O6ZvWfCCW6p991IrGiawvWWw/0gLfd1kjFkWvj5W1By5jpeZjg2geScJ
Eyuf+Qh+cY5BTKtIpmP/Ot5m7DXZc7vKreaQM3MwawcwkSL70Q30WYea54mF
hGqoMDBlhY16Dqo/m1KtkJVVqICW2R6NKApKxe19/eecEpPSewYHth/nouRv
r7+QLRwOY4txCiqB/fOfg4gq7UTsZxxsfIX/2ztGdJmxVJjjcgYeIqlTLOxj
zfcGVU3DMaDPNwdakUBzqsnAmZdss6meMFzzmn1OqPesrE9gg5Zxh2qHYMMf
NXiB
=FYuw
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Feb 25, 2016 at 12:57 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 25 Feb 2016, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Looking through the logs, I don't see any attempt to try and prime the
>> pg_temp. In the code if mon_osd_prime_pg_temp is true then it tries,
>> but this is false by default. I'll keep looking into this. I hope it
>> is something simple like this.
>
> Oh!  Try setting it to true?
>
> I didn't realize it was off by default; we should probably turn it on in
> Jewel.
>
> sage
>
>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.3.6
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWz0PjCRDmVDuy+mK58QAABYIQAMAGbrJ0d8gUJFwSxTJi
>> +oJkfVgtSd4u6lPwc9KG1PNROPGNF2XzCvv9KfPjhzAR6obcyOWKbPZVJDn4
>> fro96cOmC27XUL3nQ10FiUHoJmlqxieF0KIRQ3R1qkiCF2PpJXPBOGDrvCX9
>> R3djgzB7FA8iO84RaKB96MO5CTzW7Ys+DynRWv32Jpt4D8/K+eLQEkF3uvuH
>> eAGxhAS5RRv/ZEk+uUtPL3wTzFk/ZvcJPadJ5JpKRnIJMD0107jZ5H5ZCJGh
>> Kw3pqgX6SBryIGZDddDfHUBqAXbqLxH/qbIX8rHte8s7rk/CKMvcXZ3ojL65
>> xx7J5ZhdK13dfTW+70qWZSpsdFd2KFO57p4pB7uAZPoe6HxMj6oHRvvtJZna
>> DtaaIw4PuijrcMSQE7I0fv89cAHGr7rGFAUlnpqCUC4o8y4p/dN5pdWIl/x+
>> wXhtzWbfhaQE0ChvhzunwQp5vzyCqZNzq1TZ4tKw0vE5Wz8qFKVAvyUm7FBN
>> ACPbWELnNXJbdX6Bv/xsFKeyCe+1ERfg9HBeuLY/ppx9IyLIjB4TMQeIJ4Du
>> T08KqOgcCFvay9M2n8ufKNZQx/MYXEUpRyG2PezNeTZK+oNtg8JHiEPqFviP
>> YHr1F06VrGLj+lJVYqC6AwWdXK3ZDw+ydZpRXIieKZ/K5jv1jg59YbwrxzlL
>> w1Ew
>> =cLMb
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Feb 24, 2016 at 5:12 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > On Wed, 24 Feb 2016, Robert LeBlanc wrote:
>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> Hash: SHA256
>> >>
>> >> I've been looking through the code the last couple of days trying to
>> >> figure out how to reduce the start-up impact of OSD on busy clusters
>> >> or OSDs that are constrained on RAM. What I see in the logs is right
>> >> after "state: booting -> active" client OPs (both client and repops)
>> >> are getting queued at the OSD even though none of the PGs are active.
>> >> This causes at least 20 seconds of latency for the first OPs to get
>> >> serviced (at least on my test cluster), and that is really bad for
>> >> RBD.
>> >>
>> >> So I was looking for a way to delay the client I/O to the OSD until at
>> >> least the PG was active. It would be nice if I could even delay the
>> >> PGMap until after at least one attempt of recovery was tried so that
>> >> on busy clusters there isn't a huge backlog of OPs blocked waiting for
>> >> a long list of RepOps that have been queued from the newly started PGs
>> >> who now have client OPs to service.
>> >>
>> >> My first thought was when a client OP came in and the PG was not
>> >> active, then create a pg_temp and NACK the client (I think a pg_temp
>> >> would make the client retry anyway). I thought this might be dirty and
>> >> get some client OPs out of order and in any case add latency (bad for
>> >> RBD). But on idle clusters, there wouldn't need to be a pg_temp for
>> >> each PG.
>> >>
>> >> My second thought was in load_pgs create the pg_temp there so that at
>> >> start_boot it would just have the "right" PGMap to begin with and the
>> >> clients just keep using the original OSDs until later when the pg_temp
>> >> is removed after the PG has been recovered. I have a feeling that this
>> >> will not work because the other OSDs won't see that this OSD is up and
>> >> ready to start recovery/backfill. If the other OSDs can know that this
>> >> one is up because it contacts them and shares a map, will it suddenly
>> >> start sending it repops? I'm really not sure on this and could use
>> >> some help.
>> >>
>> >> If I can get this pg_temp thing to work, then it is possible to try
>> >> and do much of the recovery as fast as possible before any client load
>> >> even hits it. Since only recovery OPs would be hammering the disk,
>> >> there is no need to slow it down, or at least throttle it so much.
>> >>
>> >> Here is a workflow that I can see working:
>> >> 1. load_pgs - create pg_temp
>> >> 2. start_boot
>> >> 3. walk through each PG
>> >> 3a. recover the PG, if it is the second iteration, activate the PG by
>> >> removing the pg_temp
>> >> 4. Loop through 3 twice, the first to catch up the PGs as fast as
>> >> possible, the second to activate them.
>> >
>> > So, 1-3ish should sort of already be happening, but in a slightly
>> > different way: mon/OSDMonitor.cc should be calling the
>> > maybe_prime_pg_temp() and prime_pg_temp(), which should identify which PG
>> > mappings are about to change and install pg_temp mappings with their prior
>> > mappings.  That way the same OSDMap epoch that shows the OSD coming up
>> > *also* has the pg_temp keeping it out of the acting set.  Then, when
>> > peering completes and if backfill isn't necessary the mapping will be
>> > removed.
>> >
>> > First, I would verify that this is actually happening.. it sounds like
>> > maybe it isn't... and if not we should fix that.  Maybe reproduce one of
>> > these transitions (bringing a down OSD back up) and see what happens?
>> >
>> > But, assuming it is, I think the main thing that can be done is to change
>> > the peering code to do any log based recovery too to the up but not acting
>> > osd before removing the pg_temp mapping.  That way, when the pg_temp
>> > mapping is removed, it should be a very fast peering transition.
>> >
>> > I think the last thing that is probably slowing things down is that on any
>> > peering transition there is the up_thru update that has to be applied on
>> > the mon before the PG can go active.  This can be slow.  It's there so
>> > that a series of OSDMap changes don't get us in a situation where there
>> > are mappings that the map had that the OSDs never realized, the OSDs then
>> > failed, and we cannot peer because we're stuck trying to probe OSDs
>> > that never wrote data.  (This is the maybe_went_rw logic in PG.cc and
>> > osd_types.cc and we'll need to be very careful making optimizations here.)
>> >
>> > sage
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html