On Fri, Apr 12, 2019 at 10:31 AM Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: > > We have a user syncing data with some kind of rsync + hardlink based > system creating/removing large numbers of hard links. We've > encountered many of the issues with stray inode re-integration as > described in the thread and tracker below. > > As noted one fix is to increase mds_bal_fragment_size_max so the stray > directories can accommodate the high stray count. We blew right > through 200,000, then 300,000, and at this point I'm wondering if > there is an upper safe limit on this parameter? If I go to something > like 1mil to work with this use case will I have other problems? I'd recommend to try to find a solution that doesn't require you to tweak this. We ended up essentially doing a "repository of origin files", and maybe abusing rsync --link-dest (I don't quite recall). This were a case where changes always were additive at the file level. Files never change, and are only ever added, never removed. So we didn't have to worry about "garbage collecting" it, the amounts were also pretty small. Assuming it doesn't fragment the stray directories, your primary problem is going to be omap sizes: Problems we've run into with large omaps: - Replication of omaps isn't fast if you ever have to do recovery (which you will) - LevelDB/RocksDB compaction for large sets is painful, the bigger the more painful. This is the kind of thing that'll creep up on you - you may not notice this until you have a multi-minute compaction, which ends up blocking requests at the affected osd(s) for the duration. - OSDs being flagged as down due to the above - when the omap's get sufficiently large - Specifically for ceph-mds and stray directories, potentially higher memory usage - Back on hammer - we suspected we'd found some replication corner-cases where we ended up with omap's out of sync (inconsistent objects, which required some surgery with ceph-objectstore-tool), this happened infrequently. Given that you're essentially exceeding "recommended" limits, you are more likely to find corner-cases/bugs though. In terms of "actual numbers" - I'm hesitant to commit to anything, at some point we did run mds with bal fragment size max of 10M. We didn't notice any problems, this could well be because this was the cluster that were the target of every experiments - it was a very noisy environment with relatively low expectations. Where we *really* noticed the omaps I think were well over 10M, although since it came from radosgw on jewel, it crept up on us and didn't appear on our radars until we had blocked requests an osd for minutes ending up affecting i.e. rbd. > Background: > https://www.spinics.net/lists/ceph-users/msg51985.html > http://tracker.ceph.com/issues/38849 > > thanks, > Ben > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kjetil Joergensen <kjetil@xxxxxxxxxxxx> SRE, Medallia Inc _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com