Hashicorp Vault has different Seal types, and a day we can need to migrate between two seals.
This is exactly what happened, and how I did it between Oracle Cloud KMS (ocikms) seal and Shamir Seal.
Why doing this migration? Hashicorp broke Vault with a lot of Cloud KMS.
I've my lab deployed on Oracle Cloud and I needed to deploy 1.12.0/1.12.1 to get a bug fix deployed.
You deploy it and Vault didn't restart… 😱 So I tried to run it manually from the CLI and got the real error:
/usr/bin/vault server -config=/etc/vault.d/vault.hcl
Error parsing Seal configuration: 'key_id' not found for OCI KMS seal configuration
2022-10-13T04:07:07.570Z [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy=""
So no need to modify the configuration, Vault is broken with the OCI KMS. I reported it using GitHub: seal OCI KMS don't find key_id and someone already reported an additional error with OCI KMS: Oracle KMS seal: "did not find a proper configuration for private key"
But investing a little more, it's not the only KMS broken:
So Vault is broken for any deployment on AWS, GCP, and OCI (Oracle Cloud) if you plan to use their KMS.
Migration
It needs to be done with a working version of Vault with the KMS so I rollbacked it to 1.11.4.
On each node where Vault is running (hopefully, it wasn't in a container), I added the little change in vault.hcl
On each node, I add disabled = "true"
to the seal block in /etc/vault.d/vault.hcl
:
seal "ocikms" {
crypto_endpoint = "https://xxxxxxxx-crypto.kms.sa-saopaulo-1.oraclecloud.com"
management_endpoint = "https://xxxxxxxx-management.kms.sa-saopaulo-1.oraclecloud.com"
key_id = "ocid1.key.oc1.sa-saopaulo-1.xxxxxxxx.yyyyyyyyyyyyyzzzzzzzzzzzzzz"
disabled = "true"
}
Don't reboot at this time, the quorum is needed to complete this step. In addition, I check environment variables are ready to execute vault unseal
commands.
Identify all standby nodes, the leader will be the last to be modified.
Now, I restart Vault on one node... Yes, it's one node at a time so in my case, 5 nodes so it's very time-consuming to execute this migration.
When Vault restarted, its status is sealed so let's go with vault unseal -migrate
for all keys. When this node is unsealed, we do the same on the next standby, again and again.
Now, it's time to do it on the active node, and good on this point.
When every is done, don't forget to comment/remove the seal block in /etc/vault.d/vault.hcl
.
Take care of a strange error
Something that I did was trying to add a 6th node and I saw the error: aead is not configured in the seal
To resolve it, I needed to rotate the underlying encryption key:
vault operator rotate
It's done without downtime.
Finally, I preferred generating a new set of unseal keys because they were generated when I installed Vault (a long time ago):
vault operator rekey -init -key-shares=5 -key-threshold=3
And rekey each unseal keys with
vault operator rekey
Conclusion
When you're managing some Hashicorp Vault, you need to know those tasks and can execute them without creating a high downtime.
This issue is still existing, so you have to choose between:
keep with 1.11.4
migrate and update to the latest version
I don't know what is happening at Hashicorp but they're breaking the entire deployment due to bad dependency management or another bug introduced. In any case, we won't sleep well due to this situation.