Questions about the QA process and the data format of both OSD and MON

Satoru Takeuchi <satoru.takeuchi@xxxxxxxxx> · Fri, 19 Aug 2022 15:39:19 +0900

Hi,

As I described in another mail(*1), my development Ceph cluster was
corrupted when using problematic binary.
When I upgraded to v16.2.7 + some patches (*2) + PR#45963 patch,
unfound pgs and inconsistent pgs appeared. In the end, I deleted this cluster.

  pacific: bluestore: set upper and lower bounds on rocksdb omap iterators
  https://github.com/ceph/ceph/pull/45963

This problem happened because PR#45963 causes data corruption about OSDs
which were created in octopus or older.

This patch was reverted, and the correct version (PR#46096) was applied later.

  pacific: revival and backport of fix for RocksDB optimized iterators
  https://github.com/ceph/ceph/pull/46096

It's mainly because I applied the not-well-tested patch carelessly. To
prevent the same
a mistake from happening again, let me ask some questions.

a. About QA process
   a.1 In my understanding, the test cases differ between the QA for merging
         a PR and the QA for release. For example, the upgrade test was run only
         in the release QA process. Is my understanding correct?
         I thought so because the bug in #45963 was not detected in
the QA for merging
         but was detected in the QA for release.
   a.2 If a.1 is correct, is it possible to run all test cases in both
QA? I guess that some
        time-consuming tests are skipped to improve efficient development.
   a.3 Is there any detailed document about how to run Teuthology in
the user's local environment?
         Once I tried this by reading the official document, it didn't
work well.

         https://docs.ceph.com/en/quincy/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-intro/#how-to-run-integration-tests

         At that time, Teuthology failed to connect to
paddles.front.sepia.ceph.com, which wasn't written in this document.

         ```
         requests.exceptions.ConnectionError:
HTTPConnectionPool(host='paddles.front.sepia.ceph.com', port=80): Max
retries exceeded with url: /nodes/?machine_type=vps&count=1 (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at
0x7fc945880490>: Failed to establish a new connection: [Errno 110]
Connection timed out'))
         ```
b. To minimize the risk, I'd like to use the newest data format of
both OSD and MON as possible.
    More precisely, I'd like to re-create all OSDs and MONs if their
default data format was changed.
    Please let me know if there is a convenient way to know the data
format of each OSD and MON.

    As an example, when I re-created some OSDs created in octopus or
older in my pacific cluster,
    I assumed that the older OSDs than the upgrade-to-pacific date
were created in octopus or older.
    It seemed to work, but it's better to use a more straightforward way.

*1) https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/message/TT6ZQ5LUS54ZK4NNXSDJIOBS5A2ZFAGT/
*2) PR#43581, 44413, 45502, 45654, these patches don't relate to the
topic of this mail

Best,
Satoru
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx