Re: 答复: How's cephfs going?

Kjetil Jørgensen <kjetil@xxxxxxxxxxxx> · Wed, 19 Jul 2017 14:01:42 -0700

Hi,
While not necessarily CephFS specific - we somehow seem to manage to frequently end up with objects that have inconsistent omaps. This seems to be replication (as anecdotally it's a replica that ends up diverging, and it's at least a few times something that happened after the osd that held that replica were re-started). (I had hoped http://tracker.ceph.com/issues/17177 would solve this - but it doesn't appear to have solved it completely).

We also have one workload which we'd need to re-engineer in order to be a good fit for CephFS, we do a lot of hardlinks where there's no clear "origin" file, which is slightly at odds with the hardlink implementation. If I understand correctly, unlink is move from directory tree into the stray directories, decrement link count, if link count = 0, purge, if not keep it around until you encounter another link to it and re-integrate it back in again. This netted us hilariously large stray directories, which combined with the above were less than ideal.

Beyond that - there's been other small(-ish) bugs we've encountered, but it's either been solvable by cherry-picking fixes, upgrading, or using the available tools for doing surgery guided either by the internet and/or an approximate understanding of how it's supposed to work/be).

-KJ

On Wed, Jul 19, 2017 at 11:20 AM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
Thanks Greg. I thought it was impossible when I reported 34MB for 52 million files. 

On Jul 19, 2017 1:17 PM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote:

On Wed, Jul 19, 2017 at 10:25 AM David <dclistslinux@xxxxxxxxx> wrote:
On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite <blair.bethwaite@xxxxxxxxx> wrote:
We are a data-intensive university, with an increasingly large fleet

of scientific instruments capturing various types of data (mostly

imaging of one kind or another). That data typically needs to be

stored, protected, managed, shared, connected/moved to specialised

compute for analysis. Given the large variety of use-cases we are

being somewhat more circumspect it our CephFS adoption and really only

dipping toes in the water, ultimately hoping it will become a

long-term default NAS choice from Luminous onwards.

On 18 July 2017 at 15:21, Brady Deetz <bdeetz@xxxxxxxxx> wrote:

> All of that said, you could also consider using rbd and zfs or whatever filesystem you like. That would allow you to gain the benefits of scaleout while still getting a feature rich fs. But, there are some down sides to that architecture too.

We do this today (KVMs with a couple of large RBDs attached via

librbd+QEMU/KVM), but the throughput able to be achieved this way is

nothing like native CephFS - adding more RBDs doesn't seem to help

increase overall throughput. Also, if you have NFS clients you will

absolutely need SSD ZIL. And of course you then have a single point of

failure and downtime for regular updates etc.

In terms of small file performance I'm interested to hear about

experiences with in-line file storage on the MDS.

Also, while we're talking about CephFS - what size metadata pools are

people seeing on their production systems with 10s-100s millions of

files?

On a system with 10.1 million files, metadata pool is 60MB

Unfortunately that's not really an accurate assessment, for good but terrible reasons:
1) CephFS metadata is principally stored via the omap interface (which is designed for handling things like the directory storage CephFS needs)
2) omap is implemented via Level/RocksDB
3) there is not a good way to determine which pool is responsible for which portion of RocksDBs data
4) So the pool stats do not incorporate omap data usage at all in their reports (it's part of the overall space used, and is one of the things that can make that larger than the sum of the per-pool spaces)

You could try and estimate it by looking at how much "lost" space there is (and subtracting out journal sizes and things, depending on setup). But I promise there's more than 60MB of CephFS metadata for 10.1 million files!
-Greg

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Kjetil Joergensen <kjetil@xxxxxxxxxxxx>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com