On Tue, Jul 3, 2018 at 11:53 AM Jesus Cea <jcea@xxxxxxx> wrote: > > Hi there. > > I have an issue with cephfs and multiple datapools inside. I have like > SIX datapools inside the cephfs, I control where files are stored using > xattrs in the directories. > > The "root" directory only contains directories with "xattrs" requesting > new objects to be stored in different pools. So good so far. The problem > is that "root" datapool has a ghost file per file inside the cephfs, > even when the object is actually stored in a different datapool. Taking > no space at all, but counting against the number of objects in the cephfs. > > The files are empty (0 bytes) but they have xattrs saying in what pool > the object is actually stored. Right: as you've noticed, they're not spurious, they're where we keep a "backtrace" xattr for a file. Backtraces are lazily updated paths, that enable CephFS to map an inode number to a file's metadata, which is needed when resolving hard links or NFS file handles. The trouble with the backtrace in the individual data pools is that the MDS would have to scan through the pools to find it, so instead all the files get a backtrace in the root pool. > Should this data be stored in the metadata pool? Yes, probably. As you say, it's ugly how we end up with these extra objects in a multi-data-pool situation. However, it's not always more efficient, as when there is only a single data pool, it reduces the overall object count to "piggy back" the backtrace onto the existing data object. The backtrace in the data pool has a secondary purpose: disaster recovery. If the metadata pool is damaged or lost entirely, the backtrace enables cephfs-data-scan to do a pretty good job of reconstructing an approximate filesystem tree that links back to the files. One option would be to do both by default: write the backtrace to the metadata pool for it's ordinary functional lookup purpose, but also write it back to the data pool as an intentionally redundant resilience measure. The extra write to the data pool could be disabled by anyone who wants to save the IOPS at the cost of some resilience. John . By comparison, my > metadata pool is 244MB in size but it basically uses the same size when > I had no objects and now, with 1.3 million objects: ~250 MB. > > cephfsROOT_data 70 0 0 354G > 1332929 > cephfsROOT_metadata 71 244M 0.07 354G > 1625 > black_1 72 944G 52.58 851G > 241736 > black_2 73 944G 52.59 851G > 241744 > black_3 74 953G 52.82 851G > 243990 > black_4 75 934G 52.33 851G > 239243 > black_5 76 944G 52.59 851G > 241814 > black_6 77 531G 38.44 851G 136081 > > Black_* are associated to the cephfs via "ceph fs add_data_pool XXX". > For each object created inside a "black_*", a ghost empty object is > created in "cephfsROOT_data". > > That huge number of files is causing other issues, like the warning "1 > pools have many more objects per pg than average". > > About that warning. I change the value of "mon_pg_warn_max_object_skew" > and The warning is still there. Asking the monitors about it, they show > the new value, but asking from client, it still shows "10": > > root@jcea:/var/run/ceph# ceph --admin-daemon ceph-client.jcea.asok --id > jcea config show|grep skew > "mon_timecheck_skew_interval": "30", > "mon_pg_warn_max_object_skew": "10", > > I don't know how to proceed. > > -- > Jesús Cea Avión _/_/ _/_/_/ _/_/_/ > jcea@xxxxxxx - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ > Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ > jabber / xmpp:jcea@xxxxxxxxxx _/_/ _/_/ _/_/ _/_/ _/_/ > "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ > "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ > "El amor es poner tu felicidad en la felicidad de otro" - Leibniz > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com