Skip to main content

Post-Migration Operations: Monitoring, DR, Capacity

This page describes the operational surface of a Bifrost migration: what to monitor during the program, how to back up and restore the platform components Bifrost provisions, how to size the target stack, how to upgrade it after migration, how to operate the platform for multiple tenants, and how to confirm readiness before the first production cutover.

Monitoring and Alerting

Bifrost ships with a complete metrics, logging, and alerting stack that is deployed by bifrost modernize land (or bifrost direct land).

Metrics stack

The metrics foundation is Prometheus and Grafana, deployed via the Prometheus Operator. Every platform component exposes metrics through a well-defined source:

ComponentMetrics sourceKey metrics
SparkJMX exporter and Spark History ServerExecutor count, shuffle read and write, GC time, task duration.
TrinoBuilt-in JMX endpointRunning queries, blocked queries, CPU time, memory pool usage.
Object storageObject storage exporterOSD latency, IOPS, pool utilization, placement-group state, cluster health.
AirflowStatsD exporterDAG run duration, task success and failure rates, scheduler heartbeat.
CatalogMicrometerCatalog request latency, active connections, error rate.
IlumBuilt-in Prometheus endpointJob count, session count, cluster health.
Shuffle serviceJMX exporterShuffle throughput, partition count, worker health.

Pre-built Grafana dashboards

Bifrost installs the following dashboards during bifrost modernize land:

  • Migration Progress — tables migrated per wave, storage migrated, workloads converted, services replaced. Executive-level view.
  • Data Quality — row-count parity per table, data-diff pass rates, schema drift alerts.
  • Object Storage Health — OSD latency heatmaps, pool utilization, recovery progress, IOPS.
  • Trino Performance — query latency p50/p95/p99, queue depth, CPU and memory per cluster.
  • Spark Job Metrics — job duration trends, executor utilization, shuffle data volume.
  • Airflow Operations — DAG success rates, task duration, scheduler lag.

Bifrost controller logs

The Bifrost controller host writes two log streams:

  • Primary log/var/log/bifrost/bifrost.log, structured JSON, one event per line, with timestamp, level, command, phase, cluster, message, and an optional verdict field on gate events. Rotated daily with a 30-day retention by default.
  • Command transcript/var/log/bifrost/transcripts/<cluster>/<run-id>/ captures the full stdout and stderr of every sub-process invocation (Spark jobs, DistCp runs, Helm operations). Retained per run for post-mortem analysis.

Log level defaults to INFO. Override with --verbose (equivalent to DEBUG) on any command, or set BIFROST_LOG_LEVEL=DEBUG to raise the level globally. The --log-file flag overrides the primary log location for a single invocation.

Log aggregation

Log aggregation uses Loki and Promtail. Promtail runs as a DaemonSet, collecting container logs from every pod. LogQL queries enable cross-component correlation by labels (pod name, namespace, container).

Why label-based indexing rather than full-text: Loki indexes only labels, not full text. This is significantly cheaper in storage and adequate for operational queries. Full-text search is rarely needed in Kubernetes log aggregation; label-based filtering covers the majority of operational use cases.

Alerting rules

Bifrost ships a default set of alerting rules:

- alert: CephHealthError
expr: ceph_health_status == 2
for: 5m
annotations:
summary: "Object storage cluster health is ERROR"

- alert: TrinoCoordinatorDown
expr: up{job="trino-coordinator"} == 0
for: 2m

- alert: TableMigrationFailed
expr: bifrost_table_migration_failures_total > 0
for: 1m

- alert: CephPoolNearFull
expr: ceph_pool_percent_used > 80
for: 15m

Rules are delivered as Prometheus AlertManager configuration and routed to the same Slack and IT service management channels Bifrost uses for migration notifications.

Backup and Disaster Recovery

Component backup matrix

ComponentDataMethodFrequency
Catalog PostgresTable metadata, locations, RBACpg_dump via CronJobHourly
Airflow PostgresDAG runs, connections, variables, poolspg_dump via CronJobHourly
OpenMetadata databaseEntities, lineage, glossarypg_dump via CronJobDaily
Keycloak PostgresUsers, roles, clients, realm configpg_dump via CronJobDaily
Object storage (data)Lakehouse data (Iceberg tables)Block-level mirroring or filesystem snapshotsDepends on RPO
Object storage (config)CRUSH map, pool config, auth keysNative Ceph config and auth exportDaily
Kubernetes etcdCluster state, CRDs, Secretsetcdctl snapshot saveEvery 6 hours
Ilum MongoDBJob metadata, cluster configmongodump via CronJobDaily
KafkaTopic data, consumer offsetsMirrorMaker2 to a DR cluster (if needed)Continuous

Disaster recovery for object storage

For multi-site DR, the object storage layer supports:

  • Multi-site RGW — built-in asynchronous replication between zones with configurable sync granularity.
  • Block-level mirroring — synchronous or asynchronous replication at the RBD level.
  • Application-level backup — since Iceberg tables are immutable Parquet files plus metadata, backing up the catalog database together with the object storage is sufficient for full recovery.

Platform-level backup with Velero

For Kubernetes resource backup (CRDs, ConfigMaps, Secrets, PVCs), Velero is the recommended tool. Velero writes to an S3-compatible target — either the same object storage cluster as the lakehouse or a dedicated target.

velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--bucket velero-backups \
--secret-file ./velero-credentials \
--backup-location-config region=default,s3ForcePathStyle=true,s3Url=https://rgw.example.internal:8080

velero schedule create bifrost-daily \
--schedule="0 2 * * *" \
--include-namespaces bifrost,lakehouse,airflow,trino

Pin the velero/velero-plugin-for-aws tag to the version matrix published in the velero-plugin-for-aws releases. Plugin v1.10.x pairs with Velero 1.14.x (the example above). Customers on Velero 1.13 must use plugin v1.9.x; Velero 1.15+ tracks plugin v1.11.x. Mismatched pairings fail on install.

Data Encryption

Encryption at rest

LayerMechanismConfiguration
Object storage OSDBlueStore encryption via dm-crypt / LUKSspec.storage.config.encryptedDevice: true in the storage CRD.
Object storage (S3)SSE-S3Enable per-bucket default encryption.
Postgres (all databases)Block-device encryptionInherited from the underlying storage encryption.
etcdKubernetes etcd encryption at rest--encryption-provider-config on the Kubernetes API server.

Encryption in transit

PathMechanism
Client to ingressTLS 1.3 with cert-manager and Let's Encrypt or an internal CA.
Ingress to servicesTLS via cert-manager internal certificates.
Trino coordinator to workersInternal TLS (internal-communication.https.required=true).
Spark driver to executorsSpark internal encryption (spark.network.crypto.enabled=true).
Spark to object storageHTTPS (S3A endpoint configured with https://).
Object storage OSD to OSDOn-wire encryption via the secure cluster mode.

Secret management

Three options are supported. Customers can mix them as needed.

  • Kubernetes Secrets (default) — secrets stored as Kubernetes Secret resources and encrypted at rest via etcd encryption.
  • HashiCorp Vault — the External Secrets Operator (ESO) syncs Vault secrets to Kubernetes Secrets. Vault handles rotation, audit, and dynamic credentials.
  • Sealed Secrets — encrypted secrets can live in Git repositories and are decrypted only inside the cluster. GitOps-compatible.

HDFS encryption zones (Direct path)

When migrating a CDP estate that uses HDFS encryption zones with Key Trustee Server:

  1. Inventory keys with hadoop key list, and the zone-to-key mapping with hdfs crypto -listZones. Raw key material cannot be extracted through the CLI; prepare a matching target key set in the target KMS or HSM.
  2. Store target-side keys in Vault, a cloud KMS, or a hardware security module (HSM).
  3. Configure object storage SSE-KMS (or SSE-S3) with the target key material; the on-the-wire migration re-encrypts data during DistCp.
  4. DistCp reads from the encrypted HDFS (transparent decryption via the HDFS client with a delegation token) and writes to object storage with SSE-KMS or SSE-S3 enabled.

TLS CA bootstrap

Bifrost relies on cert-manager for TLS certificate issuance on the target platform. Two provisioning patterns are supported:

  • Internet-reachable deployments. cert-manager is installed with a ClusterIssuer pointing at Let's Encrypt. Certificates are issued automatically for every hostname under global.domain.

  • Air-gapped or private deployments. cert-manager is installed with a ClusterIssuer bound to the customer's internal PKI. Two artifacts are provisioned separately:

    • CA root (trust anchor) — imported into every platform service's trust store during land, so services accept certificates chained to the customer PKI.
    • Issuing CA (intermediate) — bound to cert-manager through a CAIssuer or a Vault PKI engine (vault issuer type authenticating via approle or Kubernetes auth). cert-manager issues leaf certificates for each platform hostname by signing against the intermediate.

    Customers with an existing enterprise PKI typically publish the intermediate CA to cert-manager while keeping the offline root on a protected HSM or offline host.

HSM and KMS integration

Customer-managed keys are supported through two complementary paths, depending on whether the resource is a secret (read by the cluster) or a cryptographic key (operations stay in the key service).

  • Azure Key Vault — exposed through the External Secrets Operator (ESO) Azure Key Vault provider. ESO reads secrets or certificates from the vault and materializes them as Kubernetes Secrets for consumption by platform services.
  • AWS KMS and Google Cloud KMS — these are cryptographic-key services, not secret stores; they are not consumed through ESO. Platform services integrate directly via IAM roles (AWS IRSA) or Workload Identity (GCP), and cryptographic operations (encrypt, decrypt, sign) stay inside the key service.
  • HSM-backed keys — surfaced through HashiCorp Vault's Transit or PKCS#11 secret engine. Private key material never leaves the HSM; Vault proxies cryptographic operations on behalf of authorized clients. Vault tokens and dynamic credentials for Vault access can still be synced through ESO.
  • Rotation — driven by Vault policies or key-service-native rotation schedules. Bifrost does not cache raw key material.

For deployments without an existing KMS, Bifrost's Vault-backed secret store is sufficient for SSE-S3 key storage but is not FIPS 140-2 validated. Customers with regulatory FIPS requirements should provision a validated HSM before running bifrost modernize land.

Regulatory Compliance

Moving off a commercial Hadoop distribution has regulatory implications, particularly under EU frameworks such as DORA (Digital Operational Resilience Act, Regulation (EU) 2022/2554) and NIS2 (Directive (EU) 2022/2555). Bifrost and Ilum Enterprise are designed to align with those frameworks; this section maps each relevant obligation to the Bifrost control that addresses it.

note

The mappings below describe how Bifrost's capabilities support compliance activities. They are not a substitute for a formal compliance assessment by the customer's legal and risk teams. Customers remain responsible for evidencing compliance with their supervisory authority.

DORA Article 6 — ICT Risk Management Framework

Bifrost's phased approach with automated validation and rollback at every gate directly supports the ICT risk-management requirements of Article 6. Every migration decision is logged, version-controlled in Git, and auditable:

  • Inventories, migration plans, and per-phase decision verdicts (PROCEED / WARN / ABORT) are persisted as version-controlled artifacts.
  • The decision engine's gates act as documented risk controls; bypassing a gate requires an explicit operator action that is captured in the log.
  • Structured JSON logs (/var/log/bifrost/bifrost.log) and per-run transcripts provide the audit trail a risk function needs for post-migration review.

DORA Article 11 — Response and Recovery

Article 11 is titled "Response and recovery" in the published regulation and requires a comprehensive ICT business continuity policy, response and recovery plans, business impact analysis, and tested switchovers to redundant capacity or backups. The Classic path provides recovery artifacts that evidence these controls:

  • Recovery capability (implemented as rollback in Bifrost's Classic path) remains available for the entire program, up to and including a 5-business-day soak after validation ("the rollback window, not a recommendation" — see Validation and rollback).
  • Source-distribution packages are cached locally on every node, enabling a 2-4 hour restoration depending on the stage (package swap, services started, or post-validation soak).
  • NameNode metadata is backed up with LVM snapshots; HMS and policy databases are backed up with pg_dump.
  • Modernize and Direct paths preserve business continuity through the dual-read bridge: the legacy environment continues to serve traffic until each service is explicitly decommissioned after a silence period.

DORA Article 9(4)(e) — ICT Change Management (and RTS 2024/1774 Article 17)

ICT change management in DORA sits in Article 9 ("Protection and prevention"), paragraph 4(e), which requires documented policies, procedures, and controls for ICT change management — including changes to software, hardware, firmware, systems, and security parameters — based on a risk assessment approach. The operational detail is further specified by Commission Delegated Regulation (EU) 2024/1774, Article 17, which is titled "ICT change management". Bifrost playbooks are themselves change-management artifacts aligned with those controls:

  • Every migration operation is code-reviewed, tested, and approved through version-control workflows before execution.
  • The notification engine creates change tickets automatically through BMC Helix, ServiceNow, or Jira Service Management on every phase_start / phase_complete / gate_decision event.
  • Every action is traceable from the ticket through the structured log to the exact inventory version that drove it.

DORA Article 25 — Testing of ICT Tools and Systems

Article 25 (inside Chapter IV on digital operational resilience testing) requires financial entities to test the ICT tools and systems supporting critical or important functions. Bifrost's validation framework produces artifacts that can be used as evidence of such testing:

  • Row-count parity, hash-tree data-diff, query-parity, and schema-comparison reports for every migrated table.
  • Pre-flight and post-migration validation verdicts from the decision engine (PROCEED / WARN / ABORT).
  • TeraSort and TestDFSIO benchmark comparisons against a pre-migration baseline.

DORA Article 28 — Third-Party ICT Risk

Migrating off a commercial distribution reduces single-vendor concentration risk, which the Article 28(2) ICT third-party strategy must address:

  • Open-source Hadoop (Classic path) and the Ilum Kubernetes lakehouse (Modernize/Direct paths) reduce single-vendor concentration risk compared with a commercial distribution. Ilum Enterprise remains a third party in its own right and is governed under the same Article 28 strategy.
  • Post-migration support is available through Ilum Enterprise (24/7/365 SLA, CVE patching, audit-support documentation for DORA and GDPR) without the vendor lock-in of a commercial parent.
  • Customers retain the option to exit the Ilum Enterprise agreement and run the open-source components directly. The exit path is not obstructed by proprietary data formats or catalog implementations, and supports the documented exit strategy required by Article 28(8) for ICT services supporting critical or important functions.

NIS2 Article 21(2)(e) — Vulnerability Handling

Ilum Enterprise includes a security-patch SLA that supports NIS2 Article 21(2)(b) (incident handling) and Article 21(2)(e) (security in acquisition, development, and maintenance of network and information systems, including vulnerability handling and disclosure) obligations. Patch management is an implementation control for these obligations, not a separately named NIS2 requirement:

  • Ilum Enterprise's internal SLA delivers patches for CVE vulnerabilities at CVSS 7.0 or higher within defined timeframes. NIS2 does not prescribe a CVSS threshold; the 7.0 floor is an Ilum Enterprise commitment. Contact Ilum Enterprise support for the agreement-specific terms.
  • Patched packages are published to a private, GPG-signed repository; operators verify signatures during the package-swap phase. The signed-repository provenance gives customers verifiable evidence of Ilum-sourced binary integrity, which supports (but does not replace) the supplier-assessment work that NIS2 Article 21(3) requires entities to perform against their direct suppliers.
  • The patch path does not depend on a commercial parent's release cadence or paywalled advisories.

NIST Cybersecurity Framework 2.0 — Recover and Respond

For US federal agencies, FedRAMP customers, and any organisation that uses NIST CSF 2.0 as its common resilience language, Bifrost maps to the Recover and Respond functions:

  • RC.RP-02 (recovery actions performed) and RC.RP-03 (backups verified before restoration) — bifrost classic rollback validates against the LVM snapshot and pg_dump captured in Phase 3 Backup; for steady-state Modernize and Direct environments, the component backup matrix (pg_dump via CronJob, mongodump, etcdctl snapshot, Velero) is the verified restore source.
  • RC.RP-05 (integrity of restored assets verified) — row-count parity, hash-tree data-diff, and schema-comparison run automatically after a restore.
  • RC.CO-03 (recovery progress communicated to stakeholders) — per-run transcripts plus Slack and IT service management notifications fired on every phase_* event.
  • RS.MI (mitigation of incidents) — the decision engine's ABORT verdict triggers the automated rollback path without waiting for human intervention.

HIPAA Security Rule — US healthcare

When the migrated estate contains electronic Protected Health Information (ePHI), Bifrost's operational controls map to the HIPAA Security Rule at 45 CFR Part 164 Subpart C:

  • 45 CFR 164.308(a)(7)(ii)(A) — Data Backup Plan. LVM snapshots of NameNode metadata (Classic path, pre-finalize) plus the steady-state component backup matrix (pg_dump via CronJob for every platform Postgres, mongodump for Ilum MongoDB, etcdctl snapshot for Kubernetes etcd, Velero for Kubernetes resources).
  • 45 CFR 164.308(a)(7)(ii)(B) — Disaster Recovery Plan. 5-business-day rollback window with documented recovery procedures per stage (see the Classic rollback table); for Modernize and Direct, the dual-read bridge preserves legacy read paths during the program as a recovery fallback until each service is explicitly decommissioned.
  • 45 CFR 164.312(b) — Audit Controls. Structured JSON audit logs and per-run transcripts retain every decision verdict, command execution, and gate outcome.

Emergency Mode Operation under 45 CFR 164.308(a)(7)(ii)(C) — continuing to protect ePHI during a declared emergency — remains a customer-side control; Bifrost does not address it.

SOC 2 Type II — common trust service criteria

For service organisations that undergo AICPA SOC 2 Type II audits, Bifrost produces artifacts that evidence the Common Criteria and Availability TSCs operating effectively over a reporting period:

  • CC7.5 (recovery from identified incidents) — automated rollback and the per-run transcript of the revert.
  • CC8.1 (change management) — version-controlled playbooks and inventories, code review, decision-engine gates before destructive operations, plus the data-diff, query-parity, and schema-comparison reports produced by bifrost modernize validate as change-impact evidence.
  • A1.2 (backup and recovery infrastructure) — the component backup matrix (pg_dump via CronJob, mongodump, etcdctl snapshot, Velero for Kubernetes resources); LVM snapshots apply to the Classic path's NameNode metadata specifically.
  • A1.3 (availability testing — recovery-plan testing) — the post-rollback validation run (see Validation and rollback), which re-runs the decision engine against the restored state and therefore exercises the recovery procedure end-to-end.

ISO/IEC 27001:2022 — Annex A controls

For ISMS-certified organisations, the relevant Annex A controls map directly to Bifrost's artifact set:

  • A.5.24 (information security incident management planning and preparation) — per-phase rollback matrix and post-rollback validation procedure.
  • A.8.13 (information backup) — the component backup matrix (pg_dump via CronJob, mongodump, etcdctl snapshot, Velero for Kubernetes state); LVM snapshots apply specifically to the Classic path's NameNode metadata.
  • A.8.15 (logging) and A.8.16 (monitoring activities) — structured JSON audit logs, per-run transcripts, and the pre-built Grafana dashboards deployed by bifrost modernize land.
  • A.8.32 (change management) — version-controlled playbooks and CRD-tracked migration state.

UK and APAC operational resilience

UK and Australian financial-services customers operate under dedicated operational-resilience regimes. Bifrost's rollback window and phased execution satisfy the same evidence needs that DORA Article 11 does in the EU.

  • FCA SYSC 15A.2 (UK operational resilience, deadline 31 March 2025). Firms must identify important business services, set impact tolerances, map resources, and test. Bifrost's 5-business-day rollback window and automated revert prevent a migration from breaching an impact tolerance; JSON audit logs and per-run transcripts provide self-assessment evidence. The equivalent PRA rules (SS1/21) apply to dual-regulated firms.
  • APRA CPS 230 (Australian operational risk management, effective 1 July 2025) — paragraphs 34-36 (critical-operations register, BCP maintenance, BCP activation) and paragraph 39 (BCP testing). Bifrost maps to ¶35 via the 5-day rollback window plus the component backup matrix as recovery artifacts, ¶36 (BCP activation) via the decision-engine-driven ABORT and revert executed during a migration, and ¶39 via the per-run transcripts kept for lessons-learned reviews. Maintenance of the critical-operations register under ¶34 remains a customer-side control; Bifrost produces supporting artifacts but not the register itself.

Saudi Arabia — SAMA Cyber Security Framework and NCA ECC-2:2024

For Saudi banking customers and government / critical-systems operators, Bifrost maps to both the Saudi Central Bank (SAMA) and National Cybersecurity Authority (NCA) frameworks:

  • SAMA Cyber Security Framework v1.0 — subdomain 3.3.5 (Backup and Recovery) maps to the component backup matrix (pg_dump via CronJob, mongodump, etcdctl snapshot, Velero) and, for Classic migrations, the 5-day rollback window with LVM snapshots of NameNode metadata; 3.3.9 (Change Management) maps to version-controlled playbooks plus the decision-engine gates; 3.3.17 (Vulnerability Management) maps to the GPG-signed CVE repository and CVSS-tiered SLA; 3.4 (Third-Party Cyber Security) maps to structured JSON audit logs and the CRD-tracked migration state.
  • NCA Essential Cybersecurity Controls (ECC-2:2024) — four-domain structure (Cybersecurity Governance, Cybersecurity Defence, Cybersecurity Resilience, Third-Party and Cloud Computing Cybersecurity). Change, backup, logging, and BCM subdomains within Defence and Resilience map to the same Bifrost artifacts; the Third-Party and Cloud Computing domain is evidenced by the CRD-tracked migration state plus audit logs.
  • NCA Critical Systems Cybersecurity Controls (CSCC-1:2019) apply on top of ECC-2 for critical-systems operators. Bifrost evidences a subset (encryption in transit and at rest, BCM controls, enhanced audit logging), but full CSCC-1 coverage typically requires a customer-provided FIPS 140-2-validated HSM (see HSM and KMS integration) and additional segregation controls that sit outside Bifrost's scope.
  • Saudi PDPL (in force 14 September 2023) — Article 19 controller safeguards and Implementing Regulations Article 23 technical and organisational measures (aligned with NCA measures) are evidenced by encryption, OPA authorization, and audit logs.

United Arab Emirates — CBUAE, TDRA IAS, DFSA, FSRA

UAE customers are covered through three layers:

  • CBUAE Operational Risk Standards — Article 12 (Technology and Specific Risk Management) — change management, baseline security, documented configurations, and emergency-change record-keeping map to decision-engine gates, the 5-day rollback window, and structured JSON audit logs.
  • TDRA UAE Information Assurance Regulation v1.1 (March 2020) — Technical control T5 (Communications and Operations Management — Backup) maps to the component backup matrix (pg_dump via CronJob, mongodump, etcdctl snapshot, Velero for Kubernetes resources); T7 (Information Systems Acquisition, Development and Maintenance) maps to decision-engine gates and version-controlled playbooks; M5 (Performance Management) plus M6 (Performance Evaluation and Improvement) map to structured JSON audit logs and the per-run transcripts that evidence control effectiveness over time.
  • DFSA GEN 5.3 / FSRA (ADGM) GEN 3 and the FSRA IT Risk Management Guidance (2024) — systems-and-controls obligations for DIFC- and ADGM-licensed firms are evidenced by the same rollback, validation, and audit-log artifact set.
  • UAE PDPL (Federal Decree-Law 45/2021), Article 20 — technical and organisational measures aligned with international best practice, evidenced by encryption at rest / in transit and Vault / HSM integration.

Indonesia — OJK POJK 11/POJK.03/2022 and UU PDP

For Indonesian commercial banks and processors of personal data:

  • OJK POJK 11/POJK.03/2022 (successor to POJK 38/2016) — articles on IT governance, IT risk management, data management, cybersecurity, IT outsourcing, and audit trails map to decision-engine gates, the component backup matrix (pg_dump via CronJob, mongodump, etcdctl snapshot, Velero), the Classic 5-day rollback window with LVM-snapshot NameNode assets, structured JSON audit logs, and data-diff validation.
  • UU PDP Law 27/2022 (effective 17 October 2024) — Article 35 risk-proportionate technical and operational security measures, and Article 51 processor obligations (which extend Articles 29, 31, and 35-39 to processors), are evidenced by encryption, OPA authorization, and Keycloak OIDC client inventory.
  • PP 71/2019 and BSSN Regulation 8/2020 — Strategic Electronic System Operator requirements including data-centre obligations and 72-hour breach notification are supported by the same backup, encryption, and audit-log artifacts.

Further frameworks

The same control set maps to these additional frameworks with only minor citation changes. Request the full mapping from Ilum Enterprise support.

  • OSFI Guideline B-13 (Canada, effective 1 January 2024) — Domain 2 sections 2.5 (change), 2.6 (incident and problem), 2.7 (disaster recovery).
  • FINMA Circular 2023/1 (Switzerland, effective 1 January 2024) — Chapter VIII (ICT risk management) and Chapter IX (operational resilience).
  • MAS TRM Guidelines (Singapore, January 2021) — Sections 6, 7, 9, and 11.
  • HKMA SPM TM-G-1 (Hong Kong) — technology operations and recovery controls.
  • APRA CPS 234 (Australia) — paragraphs 15-17 on information-security controls, evidenced by encryption at rest / in transit, OIDC clients, OPA policies, and Vault / HSM integration.
  • RBI Cybersecurity Framework for Banks (India, 2016) — Annex 1 baseline controls: change management, audit trails, vulnerability management.
  • GDPR Article 32 (EU) — security of processing, specifically 32(1)(b) resilience, 32(1)(c) restore availability after an incident, and 32(1)(d) regular testing.
  • CBO Cyber Security and Resilience Regulatory Framework (Oman, issued September 2023, compliance July 2024) — Technology and Operations plus Third-Party Supply Chain Management domains. Customers previously operating under CBO circular BM-1161 should confirm applicable references with CBO.
  • QCB Technology Risks Circular plus Information and Cyber Security Regulation for PSPs (Qatar) and NIA Policy v2.0 / Standard v2.1 for Qatari government and critical-service operators.
  • CBB Rulebook Volume 1 — OM-2 and OM-7 (Bahrain) — operational risk management including IT security and business continuity.
  • CBK Cyber and Operational Resilience Framework (CORF, Kuwait, issued 3 December 2025) — three baselines: Cyber Resilience, Operational Resilience, and Third-Party Risk Management. First audits run on CBK-defined schedules; Bifrost artifacts map to the change, recovery, and third-party controls within those baselines.
  • BDDK BSEBY Regulation on Information Systems of Banks (Turkey, OG 31069, 15 March 2020) and KVKK Law 6698, Article 12 — electronic-banking security and data-controller technical measures.
  • CBE Egyptian Financial Cybersecurity Framework (2020) plus CBE Resilience Directives (2024) — NIST-aligned mapping across backup, change, vulnerability, resilience, and third-party.
  • Regional PDPLs — Oman Royal Decree 6/2022 (Article 11), Bahrain PDPL Law 30/2018 with Order 43/2022 — all map to the same encryption and audit-log artifacts as GDPR Article 32.

EU Cyber Resilience Act (Regulation (EU) 2024/2847)

CRA obligations for secure-by-design software, vulnerability handling (Annex I Part II), and active-exploitation reporting (Article 13, effective September 2026 with full obligations December 2027) fall on Ilum Labs as the software manufacturer, not on Bifrost customers. The GPG-signed private repository and CVE SLA described under NIS2 above are the same controls that evidence CRA compliance on Ilum's side.

Evidencing compliance

For each obligation above, Bifrost produces an auditable artifact:

ControlArtifact
Phase-gate verdicts/var/log/bifrost/bifrost.log (structured JSON) and per-run transcripts
Change-management ticketingIT service management integration payloads, notification_config.yml
Rollback capabilitybifrost classic rollback command log plus post-rollback validation report
Patch provenanceGPG-signed package repository manifest and signature verification log
Data-diff validationbifrost modernize validate report per migrated table
CRD lifecyclekubectl describe tablemigration <name> status history
Incident response and recovery executionPost-rollback validation report plus per-run transcripts (NIST CSF RC.RP, ISO A.5.24)
Recovery-plan testingPost-rollback validation run (decision engine re-executed against restored state) (SOC 2 A1.3)
Change-impact validationData-diff, query-parity, and schema-comparison reports (SOC 2 CC8.1)
Information security controls baselineEncryption configuration export, OIDC client inventory, OPA policy bundle (APRA CPS 234, ISO A.8)

Customers doing compliance work should retain these artifacts for the duration required by their supervisory authority.

Capacity Planning

Sizing guidance for the three most common deployment scales.

Object storage sizing

Raw capacity formula:

For triple replication:
Raw = Usable x 3 / 0.80 (target 80 % maximum utilization)

For erasure coding EC(4+2):
Raw = Usable x 1.5 / 0.80

Pool utilization thresholds (applied consistently across sizing, alerting, and the readiness checklist):

StateThresholdAction
Target operating ceiling80 %Sizing formulas use this as the planning cap.
Pre-migration headroom70 %Production readiness checklist demands this before the first migration wave to leave room for incoming data.
Alert80 %CephPoolNearFull fires.
Critical degradation85 %Performance degrades sharply; plan capacity expansion before crossing this.

Per-OSD resources:

Drive typeCPU per OSDRAM per OSDNotes
HDD0.5 to 1 core4 to 5 GBMinimum, adequate for sequential workloads.
SSD / NVMe2 cores5 to 8 GBHigher CPU for increased IOPS.

Network guidance: total node network throughput should exceed the aggregate OSD throughput on that node. A node with 8 HDDs reading at 200 MB/s each needs at least 1.6 GB/s read bandwidth; with triple replication, that triples for write cluster traffic. Bonded 25 GbE is the minimum for HDD-heavy nodes; 100 GbE is recommended for NVMe-heavy nodes.

Reference tiers:

TierUsableOSDsNodesDrivePer-node specPool
Small100 TB1648 TB NVMe32 cores, 128 GB RAM, 2x25 GbEEC(4+2)
Medium500 TB48616 TB NVMe48 cores, 256 GB RAM, 2x25 GbEEC(4+2)
Large1 PB96816 TB NVMe64 cores, 384 GB RAM, 100 GbEEC(4+2)

Trino sizing

Workload profileWorkersPer-worker spec
Light (BI dashboards, fewer than 20 concurrent users)38 vCPU, 32 GB, 16 GB JVM
Medium (BI + dbt, 20 to 100 concurrent users)616 vCPU, 64 GB, 32 GB JVM
Heavy (BI + dbt + ad-hoc, 100+ concurrent users)12+16 vCPU, 64 GB, 32 GB JVM

Spark on Kubernetes sizing

Spark executor sizing translates directly from YARN sizing. The Kubernetes cluster needs enough total capacity for peak concurrent Spark executor pods plus platform services.

Rule of thumb:

K8s worker capacity >= peak YARN cluster capacity x 1.2

The 20 % headroom accommodates Kubernetes overhead (kubelet, container runtime, system daemons) and platform components.

Object storage gateway sizing

Each gateway instance handles concurrent S3A connections. For Spark and Trino parallel reads:

RGW instances >= peak concurrent S3A connections / 500

Typical deployments run 2 to 4 gateway instances behind a Kubernetes Service for load balancing.

Platform Upgrades

Trino upgrade

Rolling restart behind the Trino Gateway:

  1. Update Helm values with the new Trino image version.
  2. helm upgrade the ETL cluster first (lower risk, fewer concurrent queries).
  3. Validate with the benchmark query suite against the ETL cluster.
  4. helm upgrade the interactive cluster.
  5. The gateway routes traffic seamlessly throughout.

Object storage upgrade

Sequential: operator first, then the cluster itself:

  1. helm upgrade the object storage operator to the new version.
  2. Wait for the operator pod to stabilize.
  3. Update the cluster CRD image to the new version.
  4. The operator orchestrates a rolling OSD restart, waiting for cluster health to return to OK between restarts.
  5. Monitor cluster status throughout the rolling restart.

Secret rotation during long migrations

Modernize and Direct engagements often run for several months. Long-lived secrets must be rotated on a schedule without disrupting in-flight migrations.

SecretRotation pathImpact
Trino OAuth2 client secret (to Polaris)Update via External Secrets Operator; rolling restart of Trino coordinatorActive queries complete; new queries re-authenticate.
Polaris OAuth2 client secret (to Keycloak)Rotate in Keycloak admin, sync via ESOBrief re-authentication latency on first catalog call.
Airflow spark_default and service connectionsRotate via Airflow CLI connections command or ESO-backed connection sourceRunning DAGs continue on cached credentials; new DAG runs pick up the rotated secret.
Kerberos keytabs (source HDFS)bifrost modernize rotate-keytabs --source-cluster <name> refreshes the controller-side keytab cache and re-authenticates in-flight DistCp runsMinimal; DistCp handles keytab refresh.
Object storage static S3 keysRotate via Vault dynamic secret mount or direct replacement; restart Trino and Spark jobs to pick up the new keyRunning jobs complete with the old key; new jobs use the new key.
Superset admin passwordRotate via Superset UI or ESONo production impact.

The Classic rotate-keytabs and renew-certificates day-2 commands apply to the source cluster during Classic migrations. Modernize and Direct target-platform secret rotation is driven through Vault and External Secrets Operator.

Airflow upgrade

Use the Airflow Helm chart's built-in database migration:

  1. Scale down Airflow workers and the scheduler.
  2. Run airflow db migrate (handled by the chart's init job).
  3. Scale back up with the new image version.
  4. Validate that DAGs load correctly.

Ilum upgrade

Use standard Ilum upgrade procedures. See Ilum upgrade notes.

Multi-Tenancy

The target platform supports multi-tenant deployments, mapping the YARN queue model onto Kubernetes-native primitives.

Queue to namespace mapping

YARN queue hierarchy          Kubernetes equivalent
================================================
root YuniKorn root queue
production YuniKorn queue: root.production
etl Namespace: lakehouse-prod-etl
streaming Namespace: lakehouse-prod-streaming
development YuniKorn queue: root.development
team-analytics Namespace: lakehouse-dev-analytics
team-risk Namespace: lakehouse-dev-risk

Tenant isolation per component

ComponentIsolation mechanism
SchedulerHierarchical queues with min and max resource quotas per queue.
KubernetesNamespaces with ResourceQuota and LimitRange per namespace.
CatalogCatalogs per tenant, or namespaces within a shared catalog with RBAC.
TrinoResource groups per user or group, plus OPA policies per tenant.
AirflowSeparate DAG folders per team; KubernetesPodOperator runs in the team namespace.
SupersetRow-level security filters keyed on {{ current_user().group }}.
OPAPolicies reference JWT group claims for tenant-scoped authorization.

Resource quota example

apiVersion: v1
kind: ResourceQuota
metadata:
name: team-analytics-quota
namespace: lakehouse-dev-analytics
spec:
hard:
requests.cpu: "100"
requests.memory: "400Gi"
limits.cpu: "200"
limits.memory: "800Gi"
persistentvolumeclaims: "50"
pods: "200"

Production Readiness Checklist

Before migrating production tables to the target platform, every item below should be true. Bifrost's program-management guidance is to review this checklist formally before the first production wave.

Platform health
  • Object storage cluster in HEALTH_OK for 7 consecutive days.
  • Storage pool utilization below 70 %.
  • Catalog service accessible; catalog database backup verified.
  • Trino interactive and ETL clusters healthy; autoscaling tested.
  • Airflow scheduler healthy; a test DAG executes successfully.
  • OpenMetadata ingestion running; lineage visible for test tables.
  • OIDC flow tested end-to-end (browser to Trino query).
  • OPA policies loaded; test-query authorization verified.
Migration readiness
  • bifrost modernize discover completed; estate inventory reviewed.
  • Migration plan reviewed and approved by the data platform team.
  • Wave assignments reviewed by table owners.
  • At least 50 non-production tables migrated and validated successfully.
  • Data-diff validation passing at > 99.99 % for all test tables.
  • Query parity tested on 100+ representative queries (< 1.3x latency).
  • Rollback tested: migrate a table, validate, revert, validate again.
  • DistCp bandwidth tested; confirmed sufficient for the migration timeline.
  • Dual-read bridge operational; Trino table redirection working.
Operational readiness
  • Monitoring dashboards reviewed by the operations team.
  • Alerting rules configured and tested (fire a test alert end-to-end).
  • On-call schedule established for migration waves.
  • Runbook for common failure scenarios documented.
  • Communication channel created for migration coordination.
  • Rollback runbook printed and available during cutovers.
  • Backup and disaster-recovery procedures tested for every platform database.
  • Network connectivity verified: Kubernetes pods can reach legacy HDFS.

Next Steps