15 KiB
Deploy Lock
This is a tool to lock a cluster or service, in order to prevent people from deploying changes during test automation or restarting pods during an infrastructure incident.
Contents
Abstract
Example Usage
Prevent a deploy during an automation run
This would be used to prevent an application deploy during a test automation run, to make sure the application does not restart or change versions and invalidate the test results.
- QA starts an automation run
- Automation calls
deploy-lock lock apps/acceptance --type automation --duration 90m
- Automation calls
- Someone merges code into
develop
ofsaas-app
- The
saas-app
pipeline runs a deploy job - The deploy job calls
deploy-lock check apps/acceptance/a/saas-app/develop
, which recursively checks:apps
apps/acceptance
- locked by automation, exit with an error
apps/acceptance/a
apps/acceptance/a/saas-app
apps/acceptance/a/saas-app/develop
- Deploy job exits with an error, does not deploy
- The
- Automation pipeline ends
- Final job calls
deploy-lock unlock apps/acceptance --type automation
- Specifying the
--type
duringunlock
prevents automation/deploy jobs from accidentally removing an incident
- Specifying the
- If the final automation job does not run, the lock will still expire after 90 minutes (
--duration
)
- Final job calls
- Retry
saas-app
deploy job- No lock, runs normally
Prevent a deploy during a production incident
This would be used to prevent an application deploy during an infrastructure outage, to make sure existing pods continue running.
- DevOps receives an alert and declares an incident for the
apps/production/a
cluster - The first responder runs
deploy-lock lock apps/production --type incident --duration 6h
- This locks both production clusters while we shift traffic to the working one
- Someone merges code into
main
ofauth-app
- The
auth-app
pipeline runs a deploy job - The deploy job calls
deploy-lock check apps/production/a/auth-app
, which recursively checks:apps
apps/production
- locked by incident, exit with an error
apps/production/a
apps/production/a/auth-app
- Deploy job exits with an error, does not deploy
- The
- Incident is resolved
- First responder runs
deploy-lock unlock apps/production --type incident
- First responder runs
- Retry
auth-app
deploy job- No lock, runs normally
Prevent duplicate deploys of the same service from conflicting
This would be used to prevent multiple simultaneous deploys of the same project from conflicting with one another, in a service without ephemeral environments/branch switching.
- Someone starts a pipeline on
feature/foo
ofchat-app
- The
chat-app
pipeline runs a deploy job - The deploy job calls
deploy-lock lock apps/staging/a/chat-app
- The
- Someone else starts a pipeline on
feature/bar
ofchat-app
- The
chat-app
pipeline runs another deploy job- The first one has not finished and is still mid-rollout
- The second deploy job calls
deploy-lock lock apps/staging/a/chat-app
, which recursively checks:apps
apps/staging
apps/staging/a
apps/staging/a/chat-app
- locked by deploy, exit with an error
lock
impliescheck
- Second deploy job fails with an error, does not deploy
- The
- First deploy succeeds
- Deploy job calls
deploy-lock unlock apps/staging/a/chat-app
- Deploy job calls
- Second deploy job can be retried
- No lock, runs normally
Deploy Path
The path to a service, starting with the cluster and environment: apps/staging/a/auth-app
.
Path components may include:
- cluster
- env (
account
) - target
- service (namespace)
- branch (
ref
)
When locking a path, only the leaf path is locked, not parents.
When checking a path, each segment is checked recursively, so a lock at apps/staging
will prevent all services
from being deployed into both the apps/staging/a
and apps/staging/b
clusters.
- cluster comes first because that is how we structure the git repositories (repo = cluster, branch = env)
- to lock multiple clusters in the same environment, run the command repeatedly with the same lock data
- to lock a specific branch, put it in the path:
apps/staging/a/auth-app/main
Ultimately, the deploy path's layout should follow the hierarchy of resources that you want to lock. One potential order, for a multi-cloud Kubernetes architecture, is:
- cloud
- account
- region
- network
- cluster
- namespace
- resource name
Such as aws/staging/us-east-1/apps/a/auth-app/api
or gcp/production/us-east4-a/tools/gitlab/runner/ci
.
Including the region in the path can be limiting, but also allows locking an entire provider-region in case of serious upstream incidents.
Lock Data
Each lock must contain the following fields:
interface Lock {
type: 'automation' | 'deploy' | 'freeze' | 'incident' | 'maintenance';
path: string;
author: string;
links: Map<string, string>;
// often duplicates of path, but useful for cross-project locks
source: string;
// Timestamps, calculated from --duration and --until
created_at: number;
updated_at: number;
expires_at: number;
// CI fields, optional
ci?: {
project: string;
ref: string;
commit: string;
pipeline: string;
job: string;
}
}
If $CI
is not set, the ci
sub-struct will not be present.
Messaging
- create a new lock:
locked ${path} for ${type:friendly} until ${expires_at:datetime}
-
Locked
apps/acceptance/a
for a deploy until Sat 31 Dec, 12:00 -
Locked
gitlab/production
for an incident until Sat 31 Dec, 12:00
-
- error, existing lock:
error: ${path} is locked until ${expires_at:datetime} by ${type:friendly} in ${source}
-
Error:
apps/acceptance
is locked until Sat 31 Dec, 12:00 by an automation run intesting/staging
.
-
Friendly Types
Friendly strings for type
:
automation
:An automation run
deploy
:A deploy
freeze
:A release freeze
incident
:An incident
maintenance
:A maintenance window
Usage
Command-line Interface
> deploy-lock check apps/staging/a/auth-app # is equivalent to
> deploy-lock check apps && deploy-lock check apps/staging && deploy-lock check apps/staging/a && deploy-lock check apps/staging/a/auth-app
> deploy-lock check apps/staging/a/auth-app --recursive=false # only checks the leaf node
> deploy-lock list apps/staging # list all locks within the apps/staging path
> deploy-lock lock apps/staging --type automation --duration 60m
> deploy-lock lock apps/staging/a/auth-app --type deploy --duration 5m
> deploy-lock lock apps/staging --until 2022-12-31T12:00 # local TZ, unless Z specified
> deploy-lock prune # prune all expired locks
> deploy-lock prune apps/staging # prune expired locks within the path
> deploy-lock prune apps/staging --now future-date # prune locks that will expire by --now
> deploy-lock unlock apps/staging --type automation # unlock type must match lock type
Basic Options
- command
- one of
check
,list
,lock
,prune
,unlock
- one of
<path>
- positional, required
- string
- lock path
- always lowercase (forced in code)
/^[-a-z\/]+$/
--now
- number, optional
- defaults to current epoch time
--recursive
- boolean
- recursively check locks
- defaults to true for
check
- defaults to false for
lock
,unlock
--type
- string, enum
- type of lock
- one of
automation
,deploy
,freeze
,incident
, ormaintenance
Lock Data Options
--author
- string
- defaults to
$GITLAB_USER_EMAIL
if$GITLAB_CI
is set - defaults to
$USER
otherwise
--duration
- string
- duration of lock, relative to now
- may be given in epoch seconds (
\d+
), as an ISO-8601 date, or a human interval (30m
) - mutually exclusive with
--until
--link
- array, strings
--source
- string
- each component:
- first
- defaults to
$CLUSTER_NAME
if set - defaults to
path.split.0
otherwise
- defaults to
- second
- defaults to
$DEPLOY_ENV
if set - defaults to
path.split.1
otherwise
- defaults to
- third
- defaults to
$DEPLOY_TARGET
if set - defaults to
path.split.2
otherwise
- defaults to
- fourth
- defaults to
--ci-project
if set - defaults to
$CI_PROJECT_PATH
if set - defaults to
path.split.3
otherwise
- defaults to
- fifth
- defaults to
--ci-ref
if set - defaults to
$CI_COMMIT_REF_SLUG
if set - defaults to
path.split.4
otherwise
- defaults to
- first
--until
- string, timestamp
- duration of lock, absolute
- may be given in epoch seconds (
\d+
) or as an ISO-8601 date (intervals are not allowed) - mutually exclusive with
--duration
--ci-project
- optional string
- project path
- defaults to
$CI_PROJECT_PATH
if set - defaults to
path.split.3
otherwise
--ci-ref
- optional string
- branch or tag
- defaults to
$CI_COMMIT_REF_SLUG
if set - defaults to
path.split.4
otherwise
--ci-commit
- optional string
- SHA of ref
- defaults to
$CI_COMMIT_SHA
if set
--ci-pipeline
- optional string
- pipeline ID
- defaults to
$CI_PIPELINE_ID
if set
--ci-job
- optional string
- job ID
- defaults to
$CI_JOB_ID
if set
Storage Backend Options
--storage
- string
- one of
dynamodb
,memory
--region
- string, optional
- DynamoDB region name
--table
- string
- DynamoDB table name
--endpoint
- string, optional
- DynamoDB endpoint
- set to
http://localhost:8000
for testing with https://hub.docker.com/r/amazon/dynamodb-local
--fake
- string, optional
- a fake lock that should be added to the in-memory data store
- the in-memory data store always starts empty, this is the only way to have an existing lock
REST API
Endpoints
/locks GET
- equivalent to
deploy-lock list
- equivalent to
/locks DELETE
- equivalent to
deploy-lock prune
- equivalent to
/locks/:path GET
- equivalent to
deploy-lock check
- equivalent to
/locks/:path PUT
- equivalent to
deploy-lock lock
- equivalent to
/locks/:path DELETE
- equivalent to
deploy-lock unlock
- equivalent to
Development
Features
- in-memory data store, mostly for testing
- DynamoDB data store
- lock paths and recursive checking
- infer lock data from CI variables
Building
- Clone with
git clone git@github.com:ssube/deploy-lock.git
- Switch into the project directory with
cd deploy-lock
- Run a full lint, build, and test with
make ci
- Run the program's help with
make run-help
ornode out/src/index.js --help
Testing
- Launch DynamoDB Local with
podman run --rm -p 8000:8000 docker.io/amazon/dynamodb-local
- Create a profile with
aws configure --profile localddb
- placeholder tokens (
foo
andbar
is fine) - us-east-1 region
- json output
- placeholder tokens (
- Create a
locks
table withaws dynamodb --endpoint-url http://localhost:8000 --profile localddb create-table --attribute-definitions 'AttributeName=path,AttributeType=S' --table-name locks --key-schema 'AttributeName=path,KeyType=HASH' --billing-mode PAY_PER_REQUEST
- Run commands using
AWS_PROFILE=localddb deploy-lock --storage dynamo --table locks --endpoint http://localhost:8000 ...
TODOs
- Infer lock source from other arguments/CI variables
- SQL data store, with history (don't need to remove old records)
- S3 data store
- REST API with lock endpoints
- Kubernetes admission controller with webhook endpoint
Questions
- In the deploy path, should account come before region or region before account?
aws/us-east-1/staging
vsaws/staging/us-east-1
- This is purely a recommendation in the docs,
lock.path
andlock.source
will both be slash-delimited or array paths.
- Should there be an
update
orreplace
command?- Probably not, at least not without lock history or multi-party locks.
- When the data store can keep old locks,
replace
could expire an existing lock and create a new one - With multi-party locks,
update
could update theexpires_at
and add a new author
- Should
--recursive
be available forlock
andunlock
, or onlycheck
?- TBD
- A recursive
lock
would write multiple records - A recursive
unlock
could delete multiple records
- Should locks have multiple authors?
- TBD
- It doesn't make sense to have more than one active lock for the same path
- Or does it?
- Different levels can use
--allow
without creating multiple locks
- But having multiple authors would allow for multi-party locks
- for CI:
[gitlab, $GITLAB_USER_NAME]
- for an incident:
[first-responder, incident-commander]
- for CI:
- Each author has to
unlock
before the lock is removed/released
- Should
LockData.env
be a string/array, likeLockData.path
?- Done
- Very probably yes, because otherwise it will need
env.cloud
,env.network
, etc, and those are not always predictable/present.
- Should there be an
--allow
/LockData.allow
field?- Probably yes
- When running
check --type
, ifLockData.allow
includes--type
, it will be allowedfreeze
should allowautomation
, but notdeploy
incident
could allowdeploy
, but notautomation
- Wildcards in paths?
- Probably no, it will become confusing pretty quickly, and KV stores do not support them consistently (or at all).
- Authz for API mutations?
- If there is a REST API, it might need authn/authz.
- Keeping the API private could work.
- Authorization should be scoped by path.
- How should the
AdmissionReview
fields be mapped to path?- This could vary by user and may need to be configurable.