Re: Limits of mds bal fragment size max

Kjetil Joergensen <kjetil@xxxxxxxxxxxx> · Tue, 16 Apr 2019 11:46:57 -0700

On Fri, Apr 12, 2019 at 10:31 AM Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote:
>
> We have a user syncing data with some kind of rsync + hardlink based
> system creating/removing large numbers of hard links.  We've
> encountered many of the issues with stray inode re-integration as
> described in the thread and tracker below.
>
> As noted one fix is to increase mds_bal_fragment_size_max so the stray
> directories can accommodate the high stray count.  We blew right
> through 200,000, then 300,000, and at this point I'm wondering if
> there is an upper safe limit on this parameter?   If I go to something
> like 1mil to work with this use case will I have other problems?

I'd recommend to try to find a solution that doesn't require you to tweak this.

We ended up essentially doing a "repository of origin files", and maybe abusing
rsync --link-dest (I don't quite recall). This were a case where changes always
were additive at the file level. Files never change, and are only ever
added, never
removed. So we didn't have to worry about "garbage collecting" it, the amounts
were also pretty small.

Assuming it doesn't fragment the stray directories, your primary problem is
going to be omap sizes:

Problems we've run into with large omaps:
- Replication of omaps isn't fast if you ever have to do recovery
(which you will)
- LevelDB/RocksDB compaction for large sets is painful, the bigger the
more painful.
  This is the kind of thing that'll creep up on you - you may not
notice this until you have
  a multi-minute compaction, which ends up blocking requests at the
affected osd(s)
  for the duration.
- OSDs being flagged as down due to the above - when the omap's get
sufficiently large
- Specifically for ceph-mds and stray directories, potentially higher
memory usage
- Back on hammer - we suspected we'd found some replication corner-cases where
  we ended up with omap's out of sync (inconsistent objects, which
required some surgery
  with ceph-objectstore-tool), this happened infrequently. Given that
you're essentially
  exceeding "recommended" limits, you are more likely to find
corner-cases/bugs though.

In terms of "actual numbers" - I'm hesitant to commit to anything, at
some point we did run
mds with bal fragment size max of 10M. We didn't notice any problems,
this could well be
because this was the cluster that were the target of every experiments
- it was a very noisy
environment with relatively low expectations. Where we *really*
noticed the omaps I think
were well over 10M, although since it came from radosgw on jewel, it
crept up on us and
didn't appear on our radars until we had blocked requests an osd for
minutes ending up
affecting i.e. rbd.

> Background:
> https://www.spinics.net/lists/ceph-users/msg51985.html
> http://tracker.ceph.com/issues/38849
>
> thanks,
> Ben
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com