HashiCorp Security

Secrets Management with HashiCorp Vault in Production

Kloudpath Team January 8, 2025

Secrets sprawl is one of those problems that compounds silently until it becomes a crisis. Database passwords hardcoded in environment variables. API keys committed to Git repositories years ago that still work. Shared service accounts whose credentials have not been rotated since the team that created them left the company. We have seen this pattern at nearly every organization we work with. HashiCorp Vault is our tool of choice for solving it, but deploying Vault in production is itself a complex undertaking that demands careful planning.

High Availability with Integrated Raft Storage

Vault supports multiple storage backends, but for most deployments we recommend the integrated Raft storage that became generally available in Vault 1.4. Raft eliminates the operational burden of managing a separate Consul cluster for HA, which was the traditional approach. With Raft, Vault nodes form their own consensus cluster, and the storage is co-located with the Vault process.

We deploy Vault as a 5-node cluster across three availability zones. Five nodes gives us a quorum of three, meaning the cluster can tolerate two simultaneous node failures. The Vault pods run on dedicated Kubernetes nodes with taints to prevent other workloads from competing for resources. Each node gets a persistent volume backed by gp3 EBS volumes with 3000 IOPS provisioned, which is sufficient for most workloads.

# Helm values for Vault HA with Raft (simplified)
server:
  ha:
    enabled: true
    replicas: 5
    raft:
      enabled: true
      config: |
        storage "raft" {
          path = "/vault/data"
          retry_join {
            leader_api_addr = "https://vault-0.vault-internal:8200"
          }
          retry_join {
            leader_api_addr = "https://vault-1.vault-internal:8200"
          }
        }
  affinity: |
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: vault
          topologyKey: topology.kubernetes.io/zone

The pod anti-affinity rule is critical. Without it, Kubernetes might schedule multiple Vault pods in the same availability zone, which defeats the purpose of a 5-node cluster. We also set resource requests high enough (2 CPU, 4GB RAM) to ensure Vault pods are not evicted during node pressure events.

Kubernetes Authentication

The Kubernetes auth method is the most seamless way to authenticate workloads running in the same cluster. When a pod needs a secret, it presents its Kubernetes service account token to Vault. Vault validates the token against the Kubernetes API server and, if the service account matches a configured role, issues a Vault token with the appropriate policies.

We configure one Vault role per application, bound to a specific Kubernetes namespace and service account. This gives us fine-grained access control: the order-service in the production namespace can access the production database credentials, but the same service in the staging namespace cannot.

# Configure Kubernetes auth for order-service
vault write auth/kubernetes/role/order-service \
  bound_service_account_names=order-service \
  bound_service_account_namespaces=production \
  policies=order-service-production \
  ttl=1h

For applications that need to access Vault, we use the Vault Agent Injector, which runs as a sidecar container and automatically fetches secrets into a shared volume. The application reads secrets from a file rather than making API calls to Vault directly. This means applications do not need a Vault SDK dependency, and the secret lifecycle (fetching, caching, renewing) is handled entirely by the sidecar.

Dynamic Database Credentials

Static database credentials are one of the highest-risk secrets in any organization. They are long-lived, shared across environments, and rarely rotated. Vault's database secrets engine solves this by generating short-lived, unique credentials on demand. When an application requests database credentials, Vault creates a new database user with a TTL (we typically use 1 hour), and when the TTL expires, Vault automatically revokes the user.

We configure the database secrets engine with a connection to each database and a set of roles that define the SQL statements used to create and revoke users. For PostgreSQL, the creation statement grants specific schema permissions rather than broad database-level access.

# Configure dynamic PostgreSQL credentials
vault write database/config/orders-db \
  plugin_name=postgresql-database-plugin \
  connection_url="postgresql://{{username}}:{{password}}@orders-db.internal:5432/orders?sslmode=require" \
  allowed_roles="order-service-readonly,order-service-readwrite" \
  username="vault_admin" \
  password="initial-password"

vault write database/roles/order-service-readonly \
  db_name=orders-db \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; \
    GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
  revocation_statements="DROP ROLE IF EXISTS \"{{name}}\";" \
  default_ttl="1h" \
  max_ttl="24h"

The operational benefit is significant. When a credential is compromised, the blast radius is limited to the TTL of that specific credential. There is no need for an emergency rotation across all services that share a password. And because every credential is unique to a specific application instance, audit logs can trace database access back to the exact pod that issued the query.

Secrets Rotation Strategy

For secrets that cannot be made dynamic (third-party API keys, encryption keys, shared secrets with partners), we implement a rotation strategy using Vault's versioned KV secrets engine. Each secret has a rotation schedule, and an automated pipeline reads the current version, generates a new value, updates the secret in Vault, and verifies that all consumers have picked up the new version.

The rotation pipeline runs as a Kubernetes CronJob. For API keys, it calls the provider's key rotation API, stores the new key in Vault, waits for all consuming pods to pick up the new value (detected by checking the Vault audit log for read events), and then deactivates the old key. For encryption keys, we use an envelope encryption pattern where the data encryption key is itself encrypted with a key-encryption key stored in Vault, so rotating the KEK does not require re-encrypting all data.

Audit Logging

Vault's audit log records every interaction with the secrets engine: who accessed which secret, when, and from where. We enable two audit backends simultaneously: a file backend that writes to a persistent volume (for local debugging) and a socket backend that streams to our centralized logging system (Elasticsearch). The dual-backend approach ensures that audit logs are never lost, even if one backend is temporarily unavailable.

We built custom Grafana dashboards that visualize audit log data. The dashboards show secret access patterns over time, which makes it easy to spot anomalies. If a service that normally reads its database credentials once per hour suddenly starts reading them every minute, that is worth investigating. We also alert on any access to high-sensitivity paths (like the root token path or the sys/seal endpoint) regardless of frequency.

Disaster Recovery

Vault is a critical dependency for every service that reads secrets from it. If Vault is unavailable, applications cannot start, cannot rotate credentials, and cannot access new secrets. Our DR strategy has three layers.

First, the Raft cluster provides resilience against node failures. With five nodes, we can lose two and maintain a quorum. Second, we take automated Raft snapshots every 6 hours and store them in an S3 bucket with cross-region replication. These snapshots can restore a complete Vault cluster from scratch in under 15 minutes. Third, we maintain a break-glass procedure: a sealed Vault instance in a secondary region that can be unsealed and promoted to primary if the entire primary region is lost.

We test the DR procedure quarterly by actually restoring from a snapshot to a fresh cluster and validating that all secrets are accessible. Untested backups are not backups, and the same principle applies to Vault disaster recovery. Each test is documented, and any issues found are resolved before the next production deployment.

Operational Lessons

Auto-unseal is non-negotiable. Using AWS KMS for auto-unseal eliminates the need for operators to manually enter unseal keys after a restart. Without it, a rolling update of Vault pods requires human intervention for each pod.
Monitor seal status as a critical alert. A sealed Vault is a dead Vault. Alert immediately and page on-call if any node enters a sealed state.
Start with the Vault Agent Injector. It is far easier than integrating the Vault SDK into every application, and it decouples secret management from application code.
Namespace everything. Vault namespaces provide isolation between teams. Each team gets their own namespace with their own policies, secrets engines, and audit logs.
Rotate the root token. After initial setup, generate a new root token, use it to configure the cluster, then immediately revoke it. All ongoing operations should use role-specific tokens.

← Back to Blog