Ceph cache-pool overflow

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi lists! I have the interesting problem:

I have ceph-cluster consisting of 3 nodes with cache pool.
Cache pool consists of PLEXTOR PX-AG128M6e drives, 2 for each node, in the sum 6 pieces of 128GB

And has the following parameters when cluster is create:

size: 2
min_size: 1
crash_replay_interval: 0
pg_num: 512
pgp_num: 512
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 300647710720
cache_target_dirty_ratio: 0.4
cache_target_full_ratio: 0.8
cache_min_flush_age: 0
cache_min_evict_age: 0

This configuration works fine for quite some time, until it began to be used more actively in production. I must to say that we store virtual machines drives in ceph, and each night snapshots are automatically created for all virtual machines contained in Ceph

Everything was fine, but once during night snapshots, the following happened:
All virtual machines freeze. Ceph I/O has not worked.
Ceph reported that some PGs were filled:

# ceph -s
     health HEALTH_ERR
            2 pgs backfill_toofull
            2 pgs stuck unclean
            2 requests are blocked > 32 sec
            recovery 572/475366 objects misplaced (0.120%)
            1 full osd(s)
            3 near full osd(s)
monmap e1: 3 mons at {HV-01=10.10.101.11:6789/0,HV-02=10.10.101.12:6789/0,HV-03=10.10.101.13:6789/0}
            election epoch 150, quorum 0,1,2 HV-01,HV-02,HV-03
     osdmap e1065: 15 osds: 15 up, 15 in; 2 remapped pgs
            flags full
      pgmap v1418948: 1024 pgs, 2 pools, 857 GB data, 231 kobjects
            1832 GB used, 49019 GB / 50851 GB avail
            572/475366 objects misplaced (0.120%)
                1022 active+clean
                   2 active+remapped+backfill_toofull

I urgently needed to solve the problem, so I set the size: 1 for ssd-cache pool and evicted most of the objects from it to the the main pool, then back set size: 2

After that, I began to study why this could happen.
It was strange, but I thought that maybe I set too great value for target_max_bytes: 300647710720. I changed it to target_max_bytes: 200000000000

In this state, it continued to work about two weeks.
Today, during night snapshot, the situation repeated itself, this time was not the filled PGs, but one of the OSD was filled:

# ceph -s
    cluster 8a2e8300-9d27-4856-99ca-05d9a9a9009c
     health HEALTH_ERR
            1 full osd(s)
            3 near full osd(s)
monmap e1: 3 mons at {HV-01=10.10.101.11:6789/0,HV-02=10.10.101.12:6789/0,HV-03=10.10.101.13:6789/0}
            election epoch 156, quorum 0,1,2 HV-01,HV-02,HV-03
     osdmap e2185: 15 osds: 15 up, 15 in
            flags full
      pgmap v2070259: 1024 pgs, 2 pools, 882 GB data, 255 kobjects
            2028 GB used, 48823 GB / 50851 GB avail
                1024 active+clean


In the zabbix graph can see that some of the OSDs in ssd-cache pool is really filled stronger than the other. Why is this happening, I do not understand, and, how in this case to count correct value for target_max_bytes?

Now, I change cache_target_full_ratio: 0.8 to cache_target_full_ratio: 0.6 for ssd-cahe pool, just in case. But like the root to solve this problem and to use the resources of SSDs to maximum?

Please write if you have any ideas on this subject.

Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux