TECHSTEP

ITインフラ関連の記事を公開してます。

Rook-Ceph Cleanup Policyの紹介

はじめに

Rook v1.3で追加された機能の一つにRook-Ceph Cleanup Policyがあります。これはCephCluster削除時にdataDirHostPathに指定したディレクトリ上のデータを削除する機能です。

Rook v1.3以前では、検証などでクラスターを削除する際、こちらのページのような手順を、ある程度手動(あるいは自作スクリプトを用意・利用)で行っていました。特にdataDirHostPathにはCeph Clusterのコンフィグ情報やログデータなどが含まれており、これが削除されない状態で新しいクラスターを作成しても、削除前クラスターの設定を新クラスターが引き継ぎ、結果的にクラスターが正しく作成されなくなってしまいます。

そこで、新しく追加されたCleanup Policyを利用することで、上記のような問題を解消し、クラスター削除を少し楽にしてくれることが期待できます。

Cleanup Policyの概要

まずCleanup Policyについて、こちらのページにある内容を紹介します。

※参考リンク:

GitHub - rook/rook: Ceph cluster clean up policy

ユースケース

ユースケースとしては、ユーザーがRook-Cephクラスターを意図的にアンインストールする場合になります。

ユーザーによる実行許可の確認

dataDirHostPathディレクトリを削除する前にユーザーがその動作を有効にすることが必須になります。これはユーザーが誤ってCRを削除した場合、dataDirHostPathを削除してしまうと元に戻せないためです。

具体的には、CephCluster CRDにspec.cleanupPolicyの設定値を追加することで、operatorがオーケストレーションを起動することを阻害し、dataDirHostPathに指定したディレクトリ内のデータを削除します。

Operatorがクラスターをcleanupする動き

クラスター削除時にOperatorがdataDirHostPathのデータを削除する動きは以下の通りです。

  • ceph cluster上にdeletionTimeStampが存在するとき、operatorはcleanupを開始する
  • clean up前にoperatorはcleanupの設定を確認する
  • ceph daemonsが起動しているノードを特定する
  • daemonより前にdataDirHostPathが削除されるとdaemonがパニックを起こすため、各ノードでceph daemonが破壊されるまで待機する。
  • 各ノード上で起動するbatch jobを作成する
  • jobは以下の動きを実行する
    • dataDriHpstPath上のnamespaceを削除
    • dataDriHpstPath上のceph-monitorディレクトリを削除
    • 各ノードのデバイスを削除

なお、Cleanup時に起動するJobは以下のように定義されています。

Cleanup Job Spec

apiVersion: batch/v1
kind: Job
metadata:
  name: rook-ceph-cleanup-<node-name>
spec:
  template:
    spec:
      containers:
        - name: rook-ceph-cleanup-<node-name>
          securityContext:
            privileged: true
          image: <rook-image>
          env:
          # if ROOK_DATA_DIR_HOST_PATH is available, then delete the dataDirHostPath
          - name: ROOK_DATA_DIR_HOST_PATH
            value: <dataDirHostPath>
          args: []string{"ceph", "clean"}
          volumeMounts:
            - name: cleanup-volume
              # data dir host path that needs to be cleaned up.
              mountPath: <dataDirHostPath>
      volume:
        - name: cleanup-volume
          hostPath:
            #directory location on the host
            path: <dataDirHostPath>
      restartPolicy: Never

Cleanup Policyの利用

ではここからCleanup Policyを実際に利用してみます。内容は公式ドキュメントのこちらのページを参照しながら進めます。

※参考リンク:

Rook Docs v1.3 - Ceph Cluster CRD

検証環境

検証環境は以下の通りです。

  • Kubernetes:
    • version: v1.17.4
    • master: 1台
    • worker: 1台
  • Rook:
    • version: v1.3

Cleanup Policyの検証

まずはRook-Cephクラスターを構築します。今回はHost-based Clusterを利用しました。構築後の状態は以下の通りです。

[root@rookmaster ceph]# kubectl get pods -n rook-ceph
NAME                                                   READY   STATUS        RESTARTS   AGE
csi-cephfsplugin-hc7nq                                 3/3     Running       0          3m14s
csi-cephfsplugin-provisioner-674847b584-scb8s          5/5     Running       0          3m14s
csi-cephfsplugin-provisioner-674847b584-xdhgd          5/5     Running       0          3m14s
csi-rbdplugin-9lsmt                                    3/3     Running       0          3m15s
csi-rbdplugin-provisioner-5777f9cf96-9ls9r             6/6     Running       0          3m15s
csi-rbdplugin-provisioner-5777f9cf96-pgswq             6/6     Running       0          3m15s
rook-ceph-crashcollector-rookworker-697d74cc96-xxvss   1/1     Terminating   0          89s
rook-ceph-crashcollector-rookworker-cb898d58-5kh9m     1/1     Running       0          29s
rook-ceph-mgr-a-6c9b758679-ts69c                       1/1     Running       0          89s
rook-ceph-mon-a-7977674f5f-f52hg                       1/1     Running       0          99s
rook-ceph-operator-599765ff49-fn858                    1/1     Running       0          8m54s
rook-ceph-osd-0-6d79874c88-2cn62                       1/1     Running       0          29s
rook-ceph-osd-prepare-rookworker-526t9                 0/1     Completed     0          68s
rook-discover-mvp9m                                    1/1     Running       0          8m37s


[root@rookmaster ceph]# kubectl get cephcluster.ceph.rook.io -n rook-ceph
NAME        DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                        HEALTH
rook-ceph   /var/lib/rook     1          5m14s   Ready   Cluster created successfully   HEALTH_WARN

# Toolboxの作成
[root@rookmaster ceph]# kubectl apply -f toolbox.yaml
deployment.apps/rook-ceph-tools created

[root@rookmaster ceph]# kubectl exec -it -n rook-ceph $(kubectl get pods -n rook-ceph -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph -s
  cluster:
    id:     58fb05e3-de72-435f-af4d-74e774d25df6
    health: HEALTH_WARN
            OSD count 1 < osd_pool_default_size 3

  services:
    mon: 1 daemons, quorum a (age 5m)
    mgr: a(active, since 4m)
    osd: 1 osds: 1 up (since 4m), 1 in (since 4m)

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   1.0 GiB used, 63 GiB / 64 GiB avail
    pgs:

[root@rookmaster ceph]#

今回クラスター作成時に利用したckuster-test.yamlは以下の通りになります。dataDirHostPathには/var/lib/rookを指定しています。

cluster-test.yaml

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    # The container image used to launch the Ceph daemon pods (mon, mgr, osd, mds, rgw).
    # v13 is mimic, v14 is nautilus, and v15 is octopus.
    # RECOMMENDATION: In production, use a specific version tag instead of the general v14 flag, which pulls the latest release and could result in different
    # versions running within the cluster. See tags available at https://hub.docker.com/r/ceph/ceph/tags/.
    # If you want to be more precise, you can always use a timestamp tag such ceph/ceph:v14.2.5-20190917
    # This tag might not contain a new Ceph version, just security fixes from the underlying operating system, which will reduce vulnerabilities
    image: ceph/ceph:v14.2.9
    # Whether to allow unsupported versions of Ceph. Currently mimic and nautilus are supported, with the recommendation to upgrade to nautilus.
    # Octopus is the version allowed when this is set to true.
    # Do not set to true in production.
    allowUnsupported: false
  # The path on the host where configuration files will be persisted. Must be specified.
  # Important: if you reinstall the cluster, make sure you delete this directory from each host or else the mons will fail to start on the new cluster.
  # In Minikube, the '/data' directory is configured to persist across reboots. Use "/data/rook" in Minikube environment.
  dataDirHostPath: /var/lib/rook
  # Whether or not upgrade should continue even if a check fails
  # This means Ceph's status could be degraded and we don't recommend upgrading but you might decide otherwise
  # Use at your OWN risk
  # To understand Rook's upgrade process of Ceph, read https://rook.io/docs/rook/master/ceph-upgrade.html#ceph-version-upgrades
  skipUpgradeChecks: false
  # Whether or not continue if PGs are not clean during an upgrade
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  # set the amount of mons to be started
  mon:
    count: 1
    allowMultiplePerNode: false
  # mgr:
    # modules:
    # Several modules should not need to be included in this list. The "dashboard" and "monitoring" modules
    # are already enabled by other settings in the cluster CR and the "rook" module is always enabled.
    # - name: pg_autoscaler
    #   enabled: true
  # enable the ceph dashboard for viewing cluster status
  dashboard:
    enabled: true
    # serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
    # urlPrefix: /ceph-dashboard
    # serve the dashboard at the given port.
    # port: 8443
    # serve the dashboard using SSL
    ssl: true
  # enable prometheus alerting for cluster
  monitoring:
    # requires Prometheus to be pre-installed
    enabled: false
    # namespace to deploy prometheusRule in. If empty, namespace of the cluster will be used.
    # Recommended:
    # If you have a single rook-ceph cluster, set the rulesNamespace to the same namespace as the cluster or keep it empty.
    # If you have multiple rook-ceph clusters in the same k8s cluster, choose the same namespace (ideally, namespace with prometheus
    # deployed) to set rulesNamespace for all the clusters. Otherwise, you will get duplicate alerts with multiple alert definitions.
    rulesNamespace: rook-ceph
  network:
    # enable host networking
    #provider: host
    # EXPERIMENTAL: enable the Multus network provider
    #provider: multus
    #selectors:
      # The selector keys are required to be `public` and `cluster`.
      # Based on the configuration, the operator will do the following:
      #   1. if only the `public` selector key is specified both public_network and cluster_network Ceph settings will listen on that interface
      #   2. if both `public` and `cluster` selector keys are specified the first one will point to 'public_network' flag and the second one to 'cluster_network'
      #
      # In order to work, each selector value must match a NetworkAttachmentDefinition object in Multus
      #
      #public: public-conf --> NetworkAttachmentDefinition object name in Multus
      #cluster: cluster-conf --> NetworkAttachmentDefinition object name in Multus
  rbdMirroring:
    # The number of daemons that will perform the rbd mirroring.
    # rbd mirroring must be configured with "rbd mirror" from the rook toolbox.
    workers: 0
  # enable the crash collector for ceph daemon crash collection
  crashCollector:
    disable: false
  cleanupPolicy:
    # cleanupPolicy should only be added to the cluster when the cluster is about to be deleted.
    # After any field of the cleanup policy is set, Rook will stop configuring the cluster as if the cluster is about
    # to be destroyed in order to prevent these settings from being deployed unintentionally.
    # To signify that automatic deletion is desired, use the value "yes-really-destroy-data". Only this and an empty
    # string are valid values for this field.
    deleteDataDirOnHosts: ""
  # To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
  # The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node' and
  # tolerate taints with a key of 'storage-node'.
#  placement:
#    all:
#      nodeAffinity:
#        requiredDuringSchedulingIgnoredDuringExecution:
#          nodeSelectorTerms:
#          - matchExpressions:
#            - key: role
#              operator: In
#              values:
#              - storage-node
#      podAffinity:
#      podAntiAffinity:
#      topologySpreadConstraints:
#      tolerations:
#      - key: storage-node
#        operator: Exists
# The above placement information can also be specified for mon, osd, and mgr components
#    mon:
# Monitor deployments may contain an anti-affinity rule for avoiding monitor
# collocation on the same node. This is a required rule when host network is used
# or when AllowMultiplePerNode is false. Otherwise this anti-affinity rule is a
# preferred rule with weight: 50.
#    osd:
#    mgr:
  annotations:
#    all:
#    mon:
#    osd:
# If no mgr annotations are set, prometheus scrape annotations will be set by default.
#   mgr:
  resources:
# The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
#    mgr:
#      limits:
#        cpu: "500m"
#        memory: "1024Mi"
#      requests:
#        cpu: "500m"
#        memory: "1024Mi"
# The above example requests/limits can also be added to the mon and osd components
#    mon:
#    osd:
#    prepareosd:
#    crashcollector:
  # The option to automatically remove OSDs that are out and are safe to destroy.
  removeOSDsIfOutAndSafeToRemove: false
#  priorityClassNames:
#    all: rook-ceph-default-priority-class
#    mon: rook-ceph-mon-priority-class
#    osd: rook-ceph-osd-priority-class
#    mgr: rook-ceph-mgr-priority-class
  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: false
    devices:
    - name: "sdc"
    #deviceFilter:
    config:
      # metadataDevice: "md0" # specify a non-rotational storage so ceph-volume will use it as block db device of bluestore.
      # databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
      # journalSizeMB: "1024"  # uncomment if the disks are 20 GB or smaller
      # osdsPerDevice: "1" # this value can be overridden at the node or device level
      # encryptedDevice: "true" # the default value for this option is "false"
# Individual nodes and their config can be specified as well, but 'useAllNodes' above must be set to false. Then, only the named
# nodes below will be used as storage resources.  Each node's 'name' field should match their 'kubernetes.io/hostname' label.
#    nodes:
#    - name: "172.17.4.201"
#      devices: # specific devices to use for storage can be specified for each node
#      - name: "sdb"
#      - name: "nvme01" # multiple osds can be created on high performance devices
#        config:
#          osdsPerDevice: "5"
#      - name: "/dev/disk/by-id/ata-ST4000DM004-XXXX" # devices can be specified using full udev paths
#      config: # configuration can be specified at the node level which overrides the cluster level config
#        storeType: filestore
#    - name: "172.17.4.301"
#      deviceFilter: "^sd."
  # The section for configuring management of daemon disruptions during upgrade or fencing.
  disruptionManagement:
    # If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
    # via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will
    # block eviction of OSDs by default and unblock them safely when drains are detected.
    managePodBudgets: false
    # A duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the
    # default DOWN/OUT interval) when it is draining. This is only relevant when  `managePodBudgets` is `true`. The default value is `30` minutes.
    osdMaintenanceTimeout: 30
    # If true, the operator will create and manage MachineDisruptionBudgets to ensure OSDs are only fenced when the cluster is healthy.
    # Only available on OpenShift.
    manageMachineDisruptionBudgets: false
    # Namespace in which to watch for the MachineDisruptionBudgets.
    machineDisruptionBudgetNamespace: openshift-machine-api

CephClusterの編集

次にCephClusterの設定を編集し、cleanUpPolicyの機能を有効にします。設定自体はとても簡単で、spec.cleanupPolicy.deleteDataDirOnHostsyes-really-destroy-dataを追加するだけで有効になります。

# 変更前の状態
[root@rookmaster ceph]# kubectl describe cephcluster.ceph.rook.io rook-ceph -n rook-ceph
Name:         rook-ceph
Namespace:    rook-ceph
Labels:       <none>

(中略)

Spec:
  Ceph Version:
    Image:  ceph/ceph:v14.2.9
  Cleanup Policy:
    Delete Data Dir On Hosts:  
  Crash Collector:
    Disable:  false

(中略)

[root@rookmaster ceph]# 


# 変更
[root@rookmaster ceph]# kubectl edit cephcluster.ceph.rook.io -n rook-ceph

spec:
  cephVersion:
    image: ceph/ceph:v14.2.9
  cleanupPolicy:
    deleteDataDirOnHosts: "yes-really-destroy-data" # ここを追加
  crashCollector:
    disable: false

cephcluster.ceph.rook.io/rook-ceph edited
[root@rookmaster ceph]#


# 変更後の状態
[root@rookmaster ceph]# kubectl describe cephcluster.ceph.rook.io rook-ceph -n rook-ceph
Name:         rook-ceph
Namespace:    rook-ceph
Labels:       <none>

(中略)

Spec:
  Ceph Version:
    Image:  ceph/ceph:v14.2.9
  Cleanup Policy:
    Delete Data Dir On Hosts:  yes-really-destroy-data
  Crash Collector:
    Disable:  false

(中略)

[root@rookmaster ceph]# 

クラスター削除

cleanUpPolicyの設定を有効にしたので、クラスターを削除します。すると以下の通りcluster-cleanup-jobが起動し、これがdataDirHostPathのデータ削除を行います。

# クラスター削除前
[root@rookworker ~]# ll /var/lib/rook/
total 0
drwxr-xr-x 3 root root 18 Apr 19 08:55 mon-a
drwxr-xr-x 4 root root 82 Apr 19 08:56 rook-ceph
[root@rookworker ~]#

# クラスター削除

[root@rookmaster ceph]# kubectl delete -f cluster-test.yaml
cephcluster.ceph.rook.io "rook-ceph" deleted


[root@rookmaster ceph]# kubectl get pods -n rook-ceph -w
NAME                                                 READY   STATUS        RESTARTS   AGE
csi-cephfsplugin-hc7nq                               3/3     Running       0          11m
csi-cephfsplugin-provisioner-674847b584-scb8s        5/5     Running       0          11m
csi-cephfsplugin-provisioner-674847b584-xdhgd        5/5     Running       0          11m
csi-rbdplugin-9lsmt                                  3/3     Running       0          11m
csi-rbdplugin-provisioner-5777f9cf96-9ls9r           6/6     Running       0          11m
csi-rbdplugin-provisioner-5777f9cf96-pgswq           6/6     Running       0          11m
rook-ceph-crashcollector-rookworker-cb898d58-5kh9m   1/1     Terminating   0          8m57s
rook-ceph-mgr-a-6c9b758679-ts69c                     0/1     Terminating   0          9m57s
rook-ceph-mon-a-7977674f5f-f52hg                     0/1     Terminating   0          10m
rook-ceph-operator-599765ff49-fn858                  1/1     Running       0          17m
rook-ceph-tools-877c4d966-7ptvf                      1/1     Running       0          6m19s
rook-discover-mvp9m                                  1/1     Running       0          17m
rook-ceph-mon-a-7977674f5f-f52hg                     0/1     Terminating   0          10m
rook-ceph-mon-a-7977674f5f-f52hg                     0/1     Terminating   0          10m
rook-ceph-mgr-a-6c9b758679-ts69c                     0/1     Terminating   0          9m59s
rook-ceph-mgr-a-6c9b758679-ts69c                     0/1     Terminating   0          9m59s
# cleanup jobが起動する
cluster-cleanup-job-rookworker-f2nm2                 0/1     Pending       0          0s
cluster-cleanup-job-rookworker-f2nm2                 0/1     Pending       0          0s
cluster-cleanup-job-rookworker-f2nm2                 0/1     ContainerCreating   0          0s
cluster-cleanup-job-rookworker-f2nm2                 0/1     Completed           0          1s
rook-ceph-crashcollector-rookworker-cb898d58-5kh9m   0/1     Terminating         0          9m23s
rook-ceph-crashcollector-rookworker-cb898d58-5kh9m   0/1     Terminating         0          9m29s
rook-ceph-crashcollector-rookworker-cb898d58-5kh9m   0/1     Terminating         0          9m29s
^C[root@rookmaster ceph]#
[root@rookmaster ceph]#


# クラスター削除後

[root@rookworker ~]# ll /var/lib/rook
total 0
[root@rookworker ~]#


[root@rookmaster ceph]# kubectl get pods -n rook-ceph
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-cleanup-job-rookworker-f2nm2            0/1     Completed   0          2m13s
csi-cephfsplugin-hc7nq                          3/3     Running     0          13m
csi-cephfsplugin-provisioner-674847b584-scb8s   5/5     Running     0          13m
csi-cephfsplugin-provisioner-674847b584-xdhgd   5/5     Running     0          13m
csi-rbdplugin-9lsmt                             3/3     Running     0          14m
csi-rbdplugin-provisioner-5777f9cf96-9ls9r      6/6     Running     0          14m
csi-rbdplugin-provisioner-5777f9cf96-pgswq      6/6     Running     0          14m
rook-ceph-operator-599765ff49-fn858             1/1     Running     0          19m
rook-ceph-tools-877c4d966-7ptvf                 1/1     Running     0          8m36s
rook-discover-mvp9m                             1/1     Running     0          19m
[root@rookmaster ceph]#

参考ドキュメント

GitHub - rook/rook: Ceph cluster clean up policy

Rook Docs v1.3 - Ceph Cluster CRD