Re: ceph-0.91 with KVstore rocksdb as objectstore backend

Haomai Wang <haomaiwang@xxxxxxxxx> · Tue, 20 Jan 2015 23:55:09 +0800

I have registered a related
issue(http://tracker.ceph.com/issues/10583), we need to make db
backend error promote to ceph's log.

On Tue, Jan 20, 2015 at 11:54 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
> Hi,
>
> Obviously, we can find lots of IO error in your rocksdb's log:
>
> 2015/01/20-19:08:13.452758 7f3a94b63700 (Original Log Time
> 2015/01/20-19:08:13.449529) [default] compacted to: files[5 6 50 492
> 361 0 0 ], 10822321443458.2 MB/sec, level 1, files in(4, 6) out(28526)
> MB in(135.9, 9.1) out(3803360014280.2),
> read-write-amplify(27979046152.7) write-amplify(27979046151.6) IO
> error: /var/lib/ceph/osd/ceph-0/current/059210.sst: Too many open
> files
> 2015/01/20-19:08:13.452760 7f3a94b63700 Waiting after background
> compaction error: IO error:
> /var/lib/ceph/osd/ceph-0/current/059210.sst: Too many open files,
> Accumulated background error counts: 2
> 2015/01/20-19:08:14.946634 7f3a94b63700 [WARN] Compaction error: IO
> error: /var/lib/ceph/osd/ceph-0/current/105226.sst: Too many open
> files
> 2015/01/20-19:08:14.946643 7f3a94b63700 (Original Log Time
> 2015/01/20-19:08:14.941764) [default] compacted to: files[6 6 50 492
> 361 0 0 ], 13401580825960.6 MB/sec, level 1, files in(6, 6) out(46014)
> MB in(205.9, 9.1) out(6136418344252.7),
> read-write-amplify(29808966236.2) write-amplify(29808966235.2) IO
> error: /var/lib/ceph/osd/ceph-0/current/105226.sst: Too many open
> files
> 2015/01/20-19:08:14.946646 7f3a94b63700 Waiting after background
> compaction error: IO error:
> /var/lib/ceph/osd/ceph-0/current/105226.sst: Too many open files,
> Accumulated background error counts: 3
> 2015/01/20-19:08:16.459162 7f3a94b63700 [WARN] Compaction error: IO
> error: /var/lib/ceph/osd/ceph-0/current/149702.sst: Too many open
> files
>
> Because you set "rocksdb_max_open_files = 10240" in your ceph.conf,
> you will let rocksdb open 10240 files. So if ceph-osd has OS fd limit
> and rocksdb will failed to open more files and raise exception.
>
> So you need to increase os fd limit to
> "rocksdb_max_open_files"+"estimated network socket in
> osd"+"filestore_fd_cache_size" at least.
>
> I'm not sure this is the only cause of your problem because of limited
> infos, but I hope it's the root cause. :-)
>
>
> Thanks for your patient, !
>
> On Tue, Jan 20, 2015 at 9:42 PM, pushpesh sharma <pushpesh.eck@xxxxxxxxx> wrote:
>> Haomai,
>>
>> PFA logs on fresh setup, with all the debug settings.
>>
>> This is what I used to dump some data:-
>>
>> rados -p benchpool1 bench 300 write  -b 4194304 -t 8 --no-cleanup
>>
>>
>> On Tue, Jan 20, 2015 at 3:59 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
>>> Yeah, thankyou.
>>>
>>> I think your cluster is failed to read/write from rocksdb. But your
>>> config disable rocksdb log file, so you can change
>>> "rocksdb_info_log_level=debug"
>>> "rocksdb_log=/var/log/ceph/ceph-osd-rocksdb.log"
>>>
>>> This log should explain the details I hope.
>>>
>>> On Tue, Jan 20, 2015 at 6:09 PM, pushpesh sharma <pushpesh.eck@xxxxxxxxx> wrote:
>>>> Haomai,
>>>>
>>>> PFA logs with debug_keyvaluestore=20/20, and perf dump output.
>>>>
>>>> On Tue, Jan 20, 2015 at 2:28 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
>>>>> Sorry, could you add debug_keyvaluestore=20/20 to your config.conf and
>>>>> run again to capture the dump logs?
>>>>>
>>>>>
>>>>> And simply view the log, it seemed that keyvaluestore failed to submit
>>>>> transaction to rocksdb.
>>>>>
>>>>> Additionally, run "ceph --admin-daemon=/var/run/ceph/[ceph-osd.*.pid]
>>>>> perf dump" is help to verify the assumption.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Tue, Jan 20, 2015 at 4:53 PM, pushpesh sharma <pushpesh.eck@xxxxxxxxx> wrote:
>>>>>> Haomai,
>>>>>>
>>>>>> PFA for the complete logs of one of the OSD daemon. In an attempt to
>>>>>> start all osd daemon, I captured logs of one of the OSD daemon is
>>>>>> pasted here:  http://pastebin.com/SRBJknCM .
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 20, 2015 at 12:34 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
>>>>>>> I think you can find related infos from log: /var/log/ceph/osd/ceph-osd*
>>>>>>>
>>>>>>> It should help us to figure out.
>>>>>>>
>>>>>>> On Tue, Jan 20, 2015 at 2:48 PM, pushpesh sharma <pushpesh.eck@xxxxxxxxx> wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I am trying to configure rocksdb as objectstore backend on a cluster
>>>>>>>> with ceph version 0.91-375-g2a4cbfc. I built ceph using' make-debs.sh'
>>>>>>>> which builds the source with  --with-rocksdb option. I was able to get
>>>>>>>> the cluster up and running with rockdbs as a backend, however as soon
>>>>>>>> as I started dumping data on cluster using radosbench , cluster become
>>>>>>>> miserable just after 10 sec of write I/Os. Some OSD daemons marked
>>>>>>>> down randomly for no apparent reason. Even if  I make all daemons
>>>>>>>> start/up again , after some time some daemons marked down again
>>>>>>>> randomly.Recovery i/o does the job this time , that external i/o done
>>>>>>>> before. What could be the possible problem and solution for this
>>>>>>>> behaviour?
>>>>>>>>
>>>>>>>> Some more details:
>>>>>>>>
>>>>>>>> 1. Setup is 3 OSD nodes with 10 SanDisk Optimus Eco (400GB)
>>>>>>>> each.Drives were working fine with filestore backend.
>>>>>>>> 2. 3 Monitors and 1 client from which I am running RadosBench.
>>>>>>>> 3. Ubuntu14.04 on each node. (3.13.0-24-generic)
>>>>>>>> 4. I create OSDs on each nodes using below script(of course with
>>>>>>>> different osd numbers):-
>>>>>>>> ##################################
>>>>>>>> #!/bin/bash
>>>>>>>> sudo stop ceph-osd-all
>>>>>>>> ps -eaf|grep osd |awk '{print $2}'|xargs sudo kill -9
>>>>>>>> osd_num=(0 1 2 3 4 5 6 7 8 9)
>>>>>>>> drives=(sdb1 sdc1 sdd1 sde1 sdf1 sdg1 sdh1 sdi1 sdj1 sdk1)
>>>>>>>> node="rack6-storage-1"
>>>>>>>> for ((i=0;i<10;i++))
>>>>>>>> do
>>>>>>>>         sudo ceph osd rm ${osd_num[i]}
>>>>>>>>         sudo ceph osd crush rm osd.${osd_num[i]}
>>>>>>>>         sudo ceph auth del osd.${osd_num[i]}
>>>>>>>>         sudo umount -f /var/lib/ceph/osd/ceph-${osd_num[i]}
>>>>>>>>         ceph osd create
>>>>>>>>         sudo rm -rf /var/lib/ceph/osd/ceph-${osd_num[i]}
>>>>>>>>         sudo mkdir -p /var/lib/ceph/osd/ceph-${osd_num[i]}
>>>>>>>>         sudo mkfs.xfs -f -i size=2048 /dev/${drives[i]}
>>>>>>>>         sudo mount -o rw,noatime,inode64,logbsize=256k,delaylog
>>>>>>>> /dev/${drives[i]} /var/lib/ceph/osd/ceph-${osd_num[i]}
>>>>>>>>         sudo ceph osd crush add osd.${osd_num[i]} 1 root=default host=$node
>>>>>>>>         sudo sudo ceph-osd --id ${osd_num[i]} -d --mkkey --mkfs
>>>>>>>> --osd-data /var/lib/ceph/osd/ceph-${osd_num[i]}
>>>>>>>>         ceph auth add osd.${osd_num[i]} osd 'allow *' mon 'allow
>>>>>>>> profile osd' -i /var/lib/ceph/osd/ceph-${osd_num[i]}/keyring
>>>>>>>>         sudo sudo ceph-osd -i ${osd_num[i]}
>>>>>>>> done
>>>>>>>> ###################################
>>>>>>>>
>>>>>>>> 5. Some configs that might be relevant are as follows:-
>>>>>>>> #########
>>>>>>>> enable_experimental_unrecoverable_data_corrupting_features = keyvaluestore
>>>>>>>> osd_objectstore = keyvaluestore
>>>>>>>> keyvaluestore_backend = rocksdb
>>>>>>>> keyvaluestore queue max ops = 500
>>>>>>>> keyvaluestore queue max bytes = 100
>>>>>>>> keyvaluestore header cache size = 2048
>>>>>>>> keyvaluestore op threads = 10
>>>>>>>> keyvaluestore_max_expected_write_size = 4096000
>>>>>>>> leveldb_write_buffer_size = 33554432
>>>>>>>> leveldb_cache_size = 536870912
>>>>>>>> leveldb_bloom_size = 0
>>>>>>>> leveldb_max_open_files = 10240
>>>>>>>> leveldb_compression = false
>>>>>>>> leveldb_paranoid = false
>>>>>>>> leveldb_log = /dev/null
>>>>>>>> leveldb_compact_on_mount = false
>>>>>>>> rocksdb_write_buffer_size = 33554432
>>>>>>>> rocksdb_cache_size = 536870912
>>>>>>>> rocksdb_bloom_size = 0
>>>>>>>> rocksdb_max_open_files = 10240
>>>>>>>> rocksdb_compression = false
>>>>>>>> rocksdb_paranoid = false
>>>>>>>> rocksdb_log = /dev/null
>>>>>>>> rocksdb_compact_on_mount = false
>>>>>>>> #########
>>>>>>>>
>>>>>>>> 6. Objects get stored in *.sst files, seems rocksbd is configured correctly:-
>>>>>>>>
>>>>>>>> ls -l /var/lib/ceph/osd/ceph-20/current/ |more
>>>>>>>> total 3169352
>>>>>>>> -rw-r--r-- 1 root root  2128430 Jan 20 00:04 000031.sst
>>>>>>>> -rw-r--r-- 1 root root  2128430 Jan 20 00:04 000033.sst
>>>>>>>> -rw-r--r-- 1 root root  2128431 Jan 20 00:04 000035.sst
>>>>>>>> ............
>>>>>>>> 7. This is current state of cluster:-
>>>>>>>> ################
>>>>>>>> monmap e1: 3 mons at
>>>>>>>> {rack6-ramp-1=10.x.x.x:6789/0,rack6-ramp-2=10.x.x.x:6789/0,rack6-ramp-3=10.x.x.x:6789/0}
>>>>>>>> election epoch 16, quorum 0,1,2 rack6-ramp-1,rack6-ramp-2,rack6-ramp-3
>>>>>>>> osdmap e547: 30 osds: 8 up, 8 in
>>>>>>>>       pgmap v1059: 512 pgs, 1 pools, 18252 MB data, 4563 objects
>>>>>>>>             22856 MB used, 2912 GB / 2934 GB avail
>>>>>>>>             1587/13689 objects degraded (11.593%)
>>>>>>>>             419/13689 objects misplaced (3.061%)
>>>>>>>>             26/4563 unfound (0.570%)
>>>>>>>> #################
>>>>>>>>
>>>>>>>> I would be happy to provide any other information that is needed.
>>>>>>>>
>>>>>>>> --
>>>>>>>> -Pushpesh
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Wheat
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -Pushpesh
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>>
>>>>> Wheat
>>>>
>>>>
>>>>
>>>> --
>>>> -Pushpesh
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> -Pushpesh
>
>
>
> --
> Best Regards,
>
> Wheat

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html