On Fri, 6 Oct 2017 16:55:31 +0100 Luis Periquito wrote: > Not looking at anything else, you didn't set the max_bytes or > max_objects for it to start flushing... > Precisely! He says, cackling, as he goes to cash in his bet. ^o^ > On Fri, Oct 6, 2017 at 4:49 PM, Shawfeng Dong <shaw@xxxxxxxx> wrote: > > Dear all, > > > > Thanks a lot for the very insightful comments/suggestions! > > > > There are 3 OSD servers in our pilot Ceph cluster, each with 2x 1TB SSDs > > (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use the bluestore > > backend, with the first NVMe as the WAL and DB devices for OSDs on the HDDs. > > And we try to create a cache tier out of the second NVMes. > > > > Here are the outputs of the commands suggested by David: > > > > 1) # ceph df > > GLOBAL: > > SIZE AVAIL RAW USED %RAW USED > > 265T 262T 2847G 1.05 > > POOLS: > > NAME ID USED %USED MAX AVAIL OBJECTS > > cephfs_data 1 0 0 248T 0 > > cephfs_metadata 2 8515k 0 248T 24 > > cephfs_cache 3 1381G 100.00 0 355385 > > > > 2) # ceph osd df > > 0 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 174 > > 1 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 169 > > 2 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173 > > 3 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 159 > > 4 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173 > > 5 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 162 > > 6 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 149 > > 7 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 179 > > 8 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 163 > > 9 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 194 > > 10 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 185 > > 11 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 168 > > 36 nvme 1.09149 1.00000 1117G 855G 262G 76.53 73.01 79 > > 12 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 180 > > 13 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 168 > > 14 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 178 > > 15 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 170 > > 16 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 149 > > 17 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 203 > > 18 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173 > > 19 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 158 > > 20 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 154 > > 21 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 160 > > 22 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 167 > > 23 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 188 > > 37 nvme 1.09149 1.00000 1117G 1061G 57214M 95.00 90.63 98 > > 24 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 187 > > 25 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 200 > > 26 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 147 > > 27 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 171 > > 28 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 162 > > 29 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 152 > > 30 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 174 > > 31 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 176 > > 32 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 182 > > 33 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 155 > > 34 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 166 > > 35 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 176 > > 38 nvme 1.09149 1.00000 1117G 857G 260G 76.71 73.18 79 > > TOTAL 265T 2847G 262T 1.05 > > MIN/MAX VAR: 0.03/90.63 STDDEV: 22.81 > > > > 3) # ceph osd tree > > -1 265.29291 root default > > -3 88.43097 host pulpo-osd01 > > 0 hdd 7.27829 osd.0 up 1.00000 1.00000 > > 1 hdd 7.27829 osd.1 up 1.00000 1.00000 > > 2 hdd 7.27829 osd.2 up 1.00000 1.00000 > > 3 hdd 7.27829 osd.3 up 1.00000 1.00000 > > 4 hdd 7.27829 osd.4 up 1.00000 1.00000 > > 5 hdd 7.27829 osd.5 up 1.00000 1.00000 > > 6 hdd 7.27829 osd.6 up 1.00000 1.00000 > > 7 hdd 7.27829 osd.7 up 1.00000 1.00000 > > 8 hdd 7.27829 osd.8 up 1.00000 1.00000 > > 9 hdd 7.27829 osd.9 up 1.00000 1.00000 > > 10 hdd 7.27829 osd.10 up 1.00000 1.00000 > > 11 hdd 7.27829 osd.11 up 1.00000 1.00000 > > 36 nvme 1.09149 osd.36 up 1.00000 1.00000 > > -5 88.43097 host pulpo-osd02 > > 12 hdd 7.27829 osd.12 up 1.00000 1.00000 > > 13 hdd 7.27829 osd.13 up 1.00000 1.00000 > > 14 hdd 7.27829 osd.14 up 1.00000 1.00000 > > 15 hdd 7.27829 osd.15 up 1.00000 1.00000 > > 16 hdd 7.27829 osd.16 up 1.00000 1.00000 > > 17 hdd 7.27829 osd.17 up 1.00000 1.00000 > > 18 hdd 7.27829 osd.18 up 1.00000 1.00000 > > 19 hdd 7.27829 osd.19 up 1.00000 1.00000 > > 20 hdd 7.27829 osd.20 up 1.00000 1.00000 > > 21 hdd 7.27829 osd.21 up 1.00000 1.00000 > > 22 hdd 7.27829 osd.22 up 1.00000 1.00000 > > 23 hdd 7.27829 osd.23 up 1.00000 1.00000 > > 37 nvme 1.09149 osd.37 up 1.00000 1.00000 > > 36 nvme 1.09149 osd.36 up 1.00000 1.00000 > > -5 88.43097 host pulpo-osd02 > > 12 hdd 7.27829 osd.12 up 1.00000 1.00000 > > 13 hdd 7.27829 osd.13 up 1.00000 1.00000 > > 14 hdd 7.27829 osd.14 up 1.00000 1.00000 > > 15 hdd 7.27829 osd.15 up 1.00000 1.00000 > > 16 hdd 7.27829 osd.16 up 1.00000 1.00000 > > 17 hdd 7.27829 osd.17 up 1.00000 1.00000 > > 18 hdd 7.27829 osd.18 up 1.00000 1.00000 > > 19 hdd 7.27829 osd.19 up 1.00000 1.00000 > > 20 hdd 7.27829 osd.20 up 1.00000 1.00000 > > 21 hdd 7.27829 osd.21 up 1.00000 1.00000 > > 22 hdd 7.27829 osd.22 up 1.00000 1.00000 > > 23 hdd 7.27829 osd.23 up 1.00000 1.00000 > > 37 nvme 1.09149 osd.37 up 1.00000 1.00000 > > -7 88.43097 host pulpo-osd03 > > 24 hdd 7.27829 osd.24 up 1.00000 1.00000 > > 25 hdd 7.27829 osd.25 up 1.00000 1.00000 > > 26 hdd 7.27829 osd.26 up 1.00000 1.00000 > > 27 hdd 7.27829 osd.27 up 1.00000 1.00000 > > 28 hdd 7.27829 osd.28 up 1.00000 1.00000 > > 29 hdd 7.27829 osd.29 up 1.00000 1.00000 > > 30 hdd 7.27829 osd.30 up 1.00000 1.00000 > > 31 hdd 7.27829 osd.31 up 1.00000 1.00000 > > 32 hdd 7.27829 osd.32 up 1.00000 1.00000 > > 33 hdd 7.27829 osd.33 up 1.00000 1.00000 > > 34 hdd 7.27829 osd.34 up 1.00000 1.00000 > > 35 hdd 7.27829 osd.35 up 1.00000 1.00000 > > 38 nvme 1.09149 osd.38 up 1.00000 1.00000 > > > > 4) # ceph osd pool get cephfs_cache all > > min_size: 2 > > crash_replay_interval: 0 > > pg_num: 128 > > pgp_num: 128 > > crush_rule: pulpo_nvme > > hashpspool: true > > nodelete: false > > nopgchange: false > > nosizechange: false > > write_fadvise_dontneed: false > > noscrub: false > > nodeep-scrub: false > > hit_set_type: bloom > > hit_set_period: 14400 > > hit_set_count: 12 > > hit_set_fpp: 0.05 > > use_gmt_hitset: 1 > > auid: 0 > > target_max_objects: 0 > > target_max_bytes: 0 > > cache_target_dirty_ratio: 0.4 > > cache_target_dirty_high_ratio: 0.6 > > cache_target_full_ratio: 0.8 > > cache_min_flush_age: 0 > > cache_min_evict_age: 0 > > min_read_recency_for_promote: 0 > > min_write_recency_for_promote: 0 > > fast_read: 0 > > hit_set_grade_decay_rate: 0 > > crash_replay_interval: 0 > > > > Do you see anything wrong? We had written some small files to the CephFS > > before we tried to write the big 1TB file. What is puzzling to me is that no > > data has been written back to the data pool. > > > > Best, > > Shaw > > > > On Fri, Oct 6, 2017 at 6:46 AM, David Turner <drakonstein@xxxxxxxxx> wrote: > >> > >> > >> > >> On Fri, Oct 6, 2017, 1:05 AM Christian Balzer <chibi@xxxxxxx> wrote: > >>> > >>> > >>> Hello, > >>> > >>> On Fri, 06 Oct 2017 03:30:41 +0000 David Turner wrote: > >>> > >>> > You're missing most all of the important bits. What the osds in your > >>> > cluster look like, your tree, and your cache pool settings. > >>> > > >>> > ceph df > >>> > ceph osd df > >>> > ceph osd tree > >>> > ceph osd pool get cephfs_cache all > >>> > > >>> Especially the last one. > >>> > >>> My money is on not having set target_max_objects and target_max_bytes to > >>> sensible values along with the ratios. > >>> In short, not having read the (albeit spotty) documentation. > >>> > >>> > You have your writeback cache on 3 nvme drives. It looks like you have > >>> > 1.6TB available between them for the cache. I don't know the behavior > >>> > of a > >>> > writeback cache tier on cephfs for large files, but I would guess that > >>> > it > >>> > can only hold full files and not flush partial files. > >>> > >>> I VERY much doubt that, if so it would be a massive flaw. > >>> One assumes that cache operations work on the RADOS object level, no > >>> matter what. > >> > >> I hope that it is on the rados level, but not a single object had been > >> flushed to the backing pool. So I hazarded a guess. Seeing his settings will > >> shed more light. > >>> > >>> > >>> > That would mean your > >>> > cache needs to have enough space for any file being written to the > >>> > cluster. > >>> > In this case a 1.3TB file with 3x replication would require 3.9TB (more > >>> > than double what you have available) of available space in your > >>> > writeback > >>> > cache. > >>> > > >>> > There are very few use cases that benefit from a cache tier. The docs > >>> > for > >>> > Luminous warn as much. > >>> You keep repeating that like a broken record. > >>> > >>> And while certainly not false I for one wouldn't be able to use (justify > >>> using) Ceph w/o cache tiers in our main use case. > >>> > >>> > >>> In this case I assume they were following and old cheat sheet or such, > >>> suggesting the previously required cache tier with EC pools. > >> > >> > >> http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/ > >> > >> I know I keep repeating it, especially recently as there have been a lot > >> of people asking about it. The Luminous docs added a large section about how > >> it is probably not what you want. Like me, it is not saying that there are > >> no use cases for it. There was no information provided about the use case > >> and I made some suggestions/guesses. I'm also guessing that they are > >> following a guide where a writeback cache was necessary for CephFS to use EC > >> prior to Luminous. I also usually add that people should test it out and > >> find what works best for them. I will always defer to your practical use of > >> cache tiers as well, especially when using rbds. > >> > >> I manage a cluster that I intend to continue running a writeback cache in > >> front of CephFS on the same drives as the EC pool. The use case receives a > >> good enough benefit from the cache tier that it isn't even required to use > >> flash media to see it. It is used for video editing and the files are > >> usually modified and read within the first 24 hours and then left in cold > >> storage until deleted. I have the cache timed to keep everything in it for > >> 24 hours and then evict it by using a minimum time to flush and evict at 24 > >> hours and a target max bytes of 0. All files are in there for that time and > >> then it never has to decide what to keep as it doesn't keep anything longer > >> than that. Luckily read performance from cold storage is not a requirement > >> of this cluster as any read operation has to first read it from EC storage, > >> write it to replica storage, and then read it from replica storage... Yuck. > >>> > >>> > >>> Christian > >>> > >>> >What is your goal by implementing this cache? If the > >>> > answer is to utilize extra space on the nvmes, then just remove it and > >>> > say > >>> > thank you. The better use of nvmes in that case are as a part of the > >>> > bluestore stack and give your osds larger DB partitions. Keeping your > >>> > metadata pool on nvmes is still a good idea. > >>> > > >>> > On Thu, Oct 5, 2017, 7:45 PM Shawfeng Dong <shaw@xxxxxxxx> wrote: > >>> > > >>> > > Dear all, > >>> > > > >>> > > We just set up a Ceph cluster, running the latest stable release Ceph > >>> > > v12.2.0 (Luminous): > >>> > > # ceph --version > >>> > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) > >>> > > luminous > >>> > > (rc) > >>> > > > >>> > > The goal is to serve Ceph filesystem, for which we created 3 pools: > >>> > > # ceph osd lspools > >>> > > 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache, > >>> > > where > >>> > > * cephfs_data is the data pool (36 OSDs on HDDs), which is > >>> > > erased-coded; > >>> > > * cephfs_metadata is the metadata pool > >>> > > * cephfs_cache is the cache tier (3 OSDs on NVMes) for cephfs_data. > >>> > > The > >>> > > cache-mode is writeback. > >>> > > > >>> > > Everything had worked fine, until today when we tried to copy a 1.3TB > >>> > > file > >>> > > to the CephFS. We got the "No space left on device" error! > >>> > > > >>> > > 'ceph -s' says some OSDs are full: > >>> > > # ceph -s > >>> > > cluster: > >>> > > id: e18516bf-39cb-4670-9f13-88ccb7d19769 > >>> > > health: HEALTH_ERR > >>> > > full flag(s) set > >>> > > 1 full osd(s) > >>> > > 1 pools have many more objects per pg than average > >>> > > > >>> > > services: > >>> > > mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01 > >>> > > mgr: pulpo-mds01(active), standbys: pulpo-admin, pulpo-mon01 > >>> > > mds: pulpos-1/1/1 up {0=pulpo-mds01=up:active} > >>> > > osd: 39 osds: 39 up, 39 in > >>> > > flags full > >>> > > > >>> > > data: > >>> > > pools: 3 pools, 2176 pgs > >>> > > objects: 347k objects, 1381 GB > >>> > > usage: 2847 GB used, 262 TB / 265 TB avail > >>> > > pgs: 2176 active+clean > >>> > > > >>> > > io: > >>> > > client: 19301 kB/s rd, 2935 op/s rd, 0 op/s wr > >>> > > > >>> > > And indeed the cache pool is full: > >>> > > # rados df > >>> > > POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY > >>> > > UNFOUND > >>> > > DEGRADED RD_OPS RD > >>> > > WR_OPS WR > >>> > > cephfs_cache 1381G 355385 0 710770 0 > >>> > > 0 > >>> > > 0 10004954 15 > >>> > > 22G 1398063 1611G > >>> > > cephfs_data 0 0 0 0 0 > >>> > > 0 > >>> > > 0 0 > >>> > > 0 0 0 > >>> > > cephfs_metadata 8515k 24 0 72 0 > >>> > > 0 > >>> > > 0 3 3 > >>> > > 072 3953 10541k > >>> > > > >>> > > total_objects 355409 > >>> > > total_used 2847G > >>> > > total_avail 262T > >>> > > total_space 265T > >>> > > > >>> > > However, the data pool is completely empty! So it seems that data has > >>> > > only > >>> > > been written to the cache pool, but not written back to the data > >>> > > pool. > >>> > > > >>> > > I am really at a loss whether this is due to a setup error on my > >>> > > part, or > >>> > > a Luminous bug. Could anyone shed some light on this? Please let me > >>> > > know if > >>> > > you need any further info. > >>> > > > >>> > > Best, > >>> > > Shaw > >>> > > _______________________________________________ > >>> > > ceph-users mailing list > >>> > > ceph-users@xxxxxxxxxxxxxx > >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > > > >>> > >>> > >>> -- > >>> Christian Balzer Network/Systems Engineer > >>> chibi@xxxxxxx Rakuten Communications > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com