Re: snap_trimming + backfilling is inefficient with many purged_snaps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



-- Dan van der Ster || Data & Storage Services || CERN IT Department --

September 18 2014 9:12 PM, "Dan van der Ster" <daniel.vanderster@xxxxxxx> wrote: 
> Hi,
> 
> September 18 2014 9:03 PM, "Florian Haas" <florian@xxxxxxxxxxx> wrote:
> 
>> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster <daniel.vanderster@xxxxxxx> wrote:
>> 
>>> Hi Florian,
>>> 
>>> On Sep 18, 2014 7:03 PM, Florian Haas <florian@xxxxxxxxxxx> wrote: 
>>>> Hi Dan,
>>>> 
>>>> saw the pull request, and can confirm your observations, at least
>>>> partially. Comments inline.
>>>> 
>>>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
>>>> <daniel.vanderster@xxxxxxx> wrote: 
>>>>>>> Do I understand your issue report correctly in that you have found
>>>>>>> setting osd_snap_trim_sleep to be ineffective, because it's being
>>>>>>> applied when iterating from PG to PG, rather than from snap to snap?
>>>>>>> If so, then I'm guessing that that can hardly be intentional…
>>>>> 
>>>>> I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer
>>> 
>>> is 
>>>> to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the
>>>> only time the current sleep implementation could be useful is if we rm’d a snap across many PGs
>>> 
>>> at 
>>>> once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at
>>>> most need to trim O(100) PGs.
>>>> 
>>>> Hmm. I'm actually seeing this in a system where the problematic snaps
>>>> could *only* have been RBD snaps.
>>> 
>>> True, as am I. The current sleep is useful in this case, but since we'd normally only expect up
>> 
>> to 
>>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather
>>> quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow
>>> requests like I have with the 30000 snap_trimq single PG.
>>> 
>>> Possibly the sleep is useful in both places.
>>> 
>>>>> We could move the snap trim sleep into the SnapTrimmer state machine, for example in
>>>> 
>>>> ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of
>>>> course the trimming PG would remain locked. And it would be locked for even longer now due to
>> 
>> the 
>>>> sleep. 
>>>>> To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve
>> 
>> done 
>>>> in this pull req: https://github.com/ceph/ceph/pull/2516 
>>>>> Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.
>>>>> 
>>>>> The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve
>>>> 
>>>> managed to reproduce that on my test cluster. All you have to do is create many pool snaps
> (e.g.
>>> 
>>> of 
>>>> a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs
>>>> around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one
> signature
>>> 
>>> of 
>>>> this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be
>>>> O(10000).
>>>> 
>>>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to
>>>> purged_snaps", but only after the snap has been purged. See
>>>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
>>>> fact that the OSD tries to trim a snap only to get an ENOENT is
>>>> probably indicative of something being fishy with the snaptrimq and/or
>>>> the purged_snaps list as well.
>>> 
>>> With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior
> as
>> 
>> I 
>>> am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap,
>> 
>> and 
>>> the contents of your pool are surely different. I also see the ENOENT messages... again
>> 
>> confirming 
>>> those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like
> that
>>> will block the OSD until they are all re-trimmed.
>> 
>> That's... a mess.
>> 
>> So what is your workaround for recovery? My hunch would be to
>> 
>> - stop all access to the cluster;
>> - set nodown and noout so that other OSDs don't mark spinning OSDs
>> down (which would cause all sorts of primary and PG reassignments,
>> useless backfill/recovery when mon osd down out interval expires,
>> etc.);
>> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
>> so that at least *between* PGs, the OSD has a chance to respond to
>> heartbeats and do whatever else it needs to do;
>> - let the snap trim play itself out over several hours (days?).
> 
> What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs
> become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG
> re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since
> other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool
> 5 PGs, otherwise it would be a disaster like you said.

Two other more risky work-arounds that I didn't try yet are:

1. lower the osd_snap_trim_thread_timeout from 3600s to something like 10 or 20s, so that these long trim operations are just killed. I have no idea if this is safe.
2. pay close attention to the slow requests and manually mark the affected OSDs down when they become blocked. by marking the trimming OSD down the IOs should go elsewhere until the OSD can recover once again later. But I don't know how the backfilling OSD will behave if it is manually marked down while trimming.

Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux