Ceph Leadership Team meeting 2023-11-29

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

- gibba nodes are used inefficiently
  - used a lot closer to the end of the major release cycle (or for
    specific projects, e.g. mclock), but largely idle in the middle of
    the release cycle
  - a considerable waste of hardware resources if used only to exercise
    upgrading to some (currently reef) backport releases
  - proposal to release gibba nodes for teuthology (Patrick)
    - for special-purpose suites where jobs require more nodes and/or
      more time than usual (e.g. running for 10h with 6-8 nodes)?
      - run tests for different components on the same cluster
        concurrently, this is lacking today except for a few bits in
        upgrade suites
    - ... or even just existing suites (Casey)
    - need Neha to weigh in as gibba cluster caretaker

- 18.2.1 blockers
  - MDS crashing on old kernel clients
    - https://github.com/ceph/ceph/pull/54677 is a temporary stop-gap
      change in smoke and powercycle suites needed for reproducing
      - increases the number of jobs in reef (scheduling with --subset
        would defeat the purpose of the change)
      - needs ack from core
    - https://github.com/ceph/ceph/pull/54407 is the fix
      - Venky to test with amended smoke suite, merge and hand off to
        Yuri for LRC upgrade
      - discussion on test suite changes would be held separately
  - https://tracker.ceph.com/issues/63618 (next item)

- potential data corruption in bluestore (!!!)
  - can occur under heavy fragmentation if db is co-located with the
    main device or after bluefs spillover to the main device, when the
    main device is configured with 64k alloc size
  - affects OSDs that were upgraded without redeploying from octopus
    and earlier releases
  - a crash on ceph_assert(available >= allocated) during OSD startup
    is an indicator
    - more likely than actual data corruption? (Igor)
    - Laura to check telemetry for instances of this assert
  - assumed to be caused by https://github.com/ceph/ceph/pull/48854
    which shipped in 18.2.0 and was backported to 16.2.14 and 17.2.6,
    meaning that all release streams are vulnerable
  - tracked in https://tracker.ceph.com/issues/63618 (hit on 17.2.7)
  - https://tracker.ceph.com/issues/62282 was hit by Adam on 17.2.6,
    Igor believes the root cause to be the same
  - for now, this is a blocker for 16.2.15 and 18.2.1
    - might necessitate hot fixes (also for quincy)

- regression for RHEL tests on main ("nothing provides lua-devel")
  - https://tracker.ceph.com/issues/63672

- 42 pacific PRs left to be triaged
  - https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+milestone%3Apacific
  - move to v16.2.15 milestone or close PR and reject backport

Thanks,

                Ilya
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux