9.0 KiB
Deploy Lock
This is a tool to lock a cluster or service, in order to prevent people from deploying changes during test automation or an infrastructure incident.
Contents
Abstract
Deploy Path
The path to a service, starting with the cluster and environment: apps/staging/a/auth-app
.
Components:
- cluster
- env (
account
) - target
- service (namespace)
- branch (
ref
)
When locking a path, only the full path is locked, not parents.
When checking a path, each segment is checked recursively, so a lock at apps/staging
will prevent all services
from being deployed into both the A and B clusters.
- cluster comes first because that is how we structure the git repositories (repo = cluster, branch = env)
- to lock multiple clusters in the same environment, run the command repeatedly with the same lock data
- to lock a specific branch, put it in the path:
apps/staging/a/auth-app/main
TODO: should the service name (component 3) map to the kubernetes namespace rather than the service?
Lock Data
Each lock must contain the following fields:
interface Lock {
type: 'automation' | 'deploy' | 'incident';
author: string;
links: Map<string, string>;
// Timestamps, calculated from --duration and --until
created_at: number;
updated_at: number;
expires_at: number;
// Env fields
// often duplicates of path, but useful for cross-project locks
env: {
cluster: string;
account: string;
target?: string; // optional
}
// CI fields, optional
ci?: {
project: string;
ref: string;
commit: string;
pipeline: string;
job: string;
}
}
Friendly strings for type
: An automation run
, A deploy
, An incident
.
If $CI
is not set, the ci
sub-struct will not be present.
TODO: should there be a type
for A release freeze
(freeze
)?
Messaging
- create a new lock: 'locked
{path} for
{type:friendly} until ${expires_at:datetime}'-
Locked
apps/acceptance/a
for a deploy until Sat 31 Dec, 12:00 -
Locked
gitlab/production
for an incident until Sat 31 Dec, 12:00
-
- error, existing lock: 'error:
{path} is locked until
{expires_at:datetime} by{type:friendly} in
{cluster}/${env}'-
Error:
apps/acceptance
is locked until Sat 31 Dec, 12:00 by an automation run intesting/staging
.
-
Example Usage
Prevent a deploy during an automation run
This would be used to prevent an application deploy during a test automation run, to make sure the application does not restart or change versions and invalidate the test results.
- QA starts an automation run
- Automation calls
deploy-lock lock apps/acceptance --type automation --duration 90m
- Automation calls
- Someone merges code into
develop
ofsaas-app
- The
saas-app
pipeline runs a deploy job - The deploy job calls
deploy-lock check apps/acceptance/a/saas-app/develop
, which recursively checks:apps
apps/acceptance
- locked by automation, exit with an error
apps/acceptance/a
apps/acceptance/a/saas-app
apps/acceptance/a/saas-app/develop
- Deploy job exits with an error, does not deploy
- The
- Automation pipeline ends
- Final job calls
deploy-lock unlock apps/acceptance --type automation
- Specifying the
--type
duringunlock
prevents automation/deploy jobs from accidentally removing an incident
- Specifying the
- If the final automation job does not run, the lock will still expire after 90 minutes (
--duration
)
- Final job calls
- Retry
saas-app
deploy job- No lock, runs normally
Prevent a deploy during a production incident
This would be used to prevent an application deploy during an infrastructure outage, to make sure existing pods continue running.
- DevOps receives an alert and declares an incident for the
apps/production/a
cluster - The first responder runs
deploy-lock lock apps/production --type incident --duration 6h
- This locks both production clusters while we shift traffic to the working one
- Someone merges code into
main
ofauth-app
- The
auth-app
pipeline runs a deploy job - The deploy job calls
deploy-lock check apps/production/a/auth-app
, which recursively checks:apps
apps/production
- locked by incident, exit with an error
apps/production/a
apps/production/a/auth-app
- Deploy job exits with an error, does not deploy
- The
- Incident is resolved
- First responder runs
deploy-lock unlock apps/production --type incident
- First responder runs
- Retry
auth-app
deploy job- No lock, runs normally
Prevent duplicate deploys of the same service from conflicting
This would be used to prevent multiple simultaneous deploys of the same project from conflicting with one another, in a service without ephemeral environments/branch switching.
- Someone starts a pipeline on
feature/foo
ofchat-app
- The
chat-app
pipeline runs a deploy job - The deploy job calls
deploy-lock lock apps/staging/a/chat-app
- The
- Someone else starts a pipeline on
feature/bar
ofchat-app
- The
chat-app
pipeline runs another deploy job- The first one has not finished and is still mid-rollout
- The second deploy job calls
deploy-lock lock apps/staging/a/chat-app
, which recursively checks:apps
apps/staging
apps/staging/a
apps/staging/a/chat-app
- locked by deploy, exit with an error
lock
impliescheck
- Second deploy job fails with an error, does not deploy
- The
- First deploy succeeds
- Deploy job calls
deploy-lock unlock apps/staging/a/chat-app
- Deploy job calls
- Second deploy job can be retried
- No lock, runs normally
Command-line Interface
> deploy-lock lock --path apps/staging --type automation --duration 60m
> deploy-lock lock --path apps/staging/a/auth-app --type deploy --duration 5m
> deploy-lock lock --path apps/staging --until 2022-12-31T12:00 # local TZ, unless Z specified
> deploy-lock unlock --path apps/staging --type automation # unlock type must match lock type
> deploy-lock check --path apps/staging/a/auth-app # is equivalent to
> deploy-lock check --path apps/staging --path apps/staging/a --path apps/staging/a/auth-app
> deploy-lock check --path apps/staging/a/auth-app --recursive=false # only checks the leaf node
> deploy-lock prune --path apps/staging # prune expired locks within the path
User Options
--type
- string, enum
- type of lock
- one of
automation
,deploy
, orincident
--path
- array, strings
- record paths
- always lowercase (force in code)
/^[-a-z\/]+$/
--author
- string
- defaults to
$GITLAB_USER_EMAIL
if$GITLAB_CI
is set - defaults to
$USER
otherwise
--duration
- number
- duration of lock, relative to now
- mutually exclusive with
--until
--until
- string, timestamp
- duration of lock, absolute
- mutually exclusive with
--duration
--recursive
- boolean
- recursively check locks
- defaults to true for
check
- defaults to false for
lock
,unlock
--env-cluster
- string, enum
- defaults to
$CLUSTER_NAME
if set - defaults to
--path.split.0
otherwise
--env-account
- string, enum
- defaults to
$DEPLOY_ENV
if set - defaults to
--path.split.1
otherwise
--env-target
- optional string
/^[a-z]$/
- defaults to
$DEPLOY_TARGET
if set - defaults to
--path.split.2
otherwise
--ci-project
- optional string
- project path
- defaults to
$CI_PROJECT_PATH
if set - defaults to
--path.split.3
otherwise
--ci-ref
- optional string
- branch or tag
- defaults to
$CI_COMMIT_REF_SLUG
if set - defaults to
--path.split.4
otherwise
--ci-commit
- optional string
- SHA of ref
- defaults to
$CI_COMMIT_SHA
if set
--ci-pipeline
- optional string
- pipeline ID
- defaults to
$CI_PIPELINE_ID
if set
--ci-job
- optional string
- job ID
- defaults to
$CI_JOB_ID
if set
TODO: should there be an update
or replace
command?
TODO: should --recursive
be available for lock
or only unlock
? A recursive lock would write multiple records
Backend Options
--storage
- string
- one of
dynamodb
,memory
--table
- string
- DynamoDB table name
REST API
Endpoints
/locks GET
/locks POST
/locks/:path GET
?