Re: [ceph-users] Re: osdmaps not trimmed until ceph-mon's restarted (if cluster has a down osd)

Bryan Stillwell <bstillwell@xxxxxxxxxxx> · Mon, 9 Dec 2019 17:24:27 +0000

On Nov 18, 2019, at 8:12 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> 
> On Fri, Nov 15, 2019 at 4:45 PM Joao Eduardo Luis <joao@xxxxxxx> wrote:
>> 
>> On 19/11/14 11:04AM, Gregory Farnum wrote:
>>> On Thu, Nov 14, 2019 at 8:14 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>>> 
>>>> Hi Joao,
>>>> 
>>>> I might have found the reason why several of our clusters (and maybe
>>>> Bryan's too) are getting stuck not trimming osdmaps.
>>>> It seems that when an osd fails, the min_last_epoch_clean gets stuck
>>>> forever (even long after HEALTH_OK), until the ceph-mons are
>>>> restarted.
>>>> 
>>>> I've updated the ticket: https://tracker.ceph.com/issues/41154
>>> 
>>> Wrong ticket, I think you meant https://tracker.ceph.com/issues/37875#note-7
>> 
>> I've seen this behavior a long, long time ago, but stopped being able to
>> reproduce it consistently enough to ensure the patch was working properly.
>> 
>> I think I have a patch here:
>> 
>>  https://github.com/ceph/ceph/pull/19076/commits
>> 
>> If you are feeling adventurous, and want to give it a try, let me know. I'll
>> be happy to forward port it to whatever you are running.
> 
> Thanks Joao, this patch is what I had in mind.
> 
> I'm trying to evaluate how adventurous this would be -- Is there any
> risk that if a huge number of osds are down all at once (but
> transiently), it would trigger the mon to trim too many maps?
> I would expect that the remaining up OSDs will have a safe, low, osd_epoch ?
> 
> And anyway I guess that your proposed get_min_last_epoch_clean patch
> is equivalent to what we have today if we restart the ceph-mon leader
> while an osd is down.

Joao,

I ran into this again today and found over 100,000 osdmaps on all 1,000 OSDs (~50 TiB of disk space used just by osdmaps).  There were down OSDs (pretty regular occurrence with ~1,000 OSDs) so that matches up with what Dan found.  Then when I restarted all the mon nodes twice the osdmaps started cleaning up.

I believe the steps to reproduce would look like this:

1. Start with a cluster with at least 1 down osd
2. Expand the cluster (the bigger the expansion, the more osdmaps that pile up)
3. Notice that after the expansion completes and the cluster is healthy that the old osdmaps aren't cleaned up

I would be willing to test the fix on our test cluster after 14.2.5 comes out.  Could you make a build based on that release?

Thanks,
Bryan
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx