split up detail docs
This commit is contained in:
parent
cfd72a3efd
commit
302bd8f4ca
233
README.md
233
README.md
|
@ -3,21 +3,21 @@
|
|||
This is a tool to lock a cluster or service, in order to prevent people from deploying changes during test automation or
|
||||
restarting pods during an infrastructure incident.
|
||||
|
||||
![readme banner top](docs/banner-top.png)
|
||||
## Features
|
||||
|
||||
- in-memory data store, mostly for testing
|
||||
- DynamoDB data store
|
||||
- lock paths and recursive checking
|
||||
- infer lock data from CI variables
|
||||
|
||||
## Contents
|
||||
|
||||
- [Deploy Lock](#deploy-lock)
|
||||
- [Features](#features)
|
||||
- [Contents](#contents)
|
||||
- [Abstract](#abstract)
|
||||
- [Example Usage](#example-usage)
|
||||
- [Prevent a deploy during an automation run](#prevent-a-deploy-during-an-automation-run)
|
||||
- [Prevent a deploy during a production incident](#prevent-a-deploy-during-a-production-incident)
|
||||
- [Prevent duplicate deploys of the same service from conflicting](#prevent-duplicate-deploys-of-the-same-service-from-conflicting)
|
||||
- [Deploy Path](#deploy-path)
|
||||
- [Concepts](#concepts)
|
||||
- [Lock Path](#lock-path)
|
||||
- [Lock Data](#lock-data)
|
||||
- [Messaging](#messaging)
|
||||
- [Friendly Types](#friendly-types)
|
||||
- [Usage](#usage)
|
||||
- [Command-line Interface](#command-line-interface)
|
||||
- [Basic Options](#basic-options)
|
||||
|
@ -26,123 +26,16 @@ restarting pods during an infrastructure incident.
|
|||
- [Admission Controller Options](#admission-controller-options)
|
||||
- [REST API](#rest-api)
|
||||
- [Endpoints](#endpoints)
|
||||
- [Development](#development)
|
||||
- [Features](#features)
|
||||
- [Building](#building)
|
||||
- [Testing](#testing)
|
||||
- [TODOs](#todos)
|
||||
- [Questions](#questions)
|
||||
|
||||
## Abstract
|
||||
## Concepts
|
||||
|
||||
### Example Usage
|
||||
![readme banner top](docs/banner-top.png)
|
||||
|
||||
#### Prevent a deploy during an automation run
|
||||
### Lock Path
|
||||
|
||||
This would be used to prevent an application deploy during a test automation run, to make sure the application does not
|
||||
restart or change versions and invalidate the test results.
|
||||
Briefly describe paths.
|
||||
|
||||
1. QA starts an automation run
|
||||
1. Automation calls `deploy-lock lock apps/acceptance --type automation --duration 90m`
|
||||
2. Someone merges code into `develop` of `saas-app`
|
||||
1. The `saas-app` pipeline runs a deploy job
|
||||
2. The deploy job calls `deploy-lock check apps/acceptance/a/saas-app/develop`, which recursively checks:
|
||||
1. `apps`
|
||||
2. `apps/acceptance`
|
||||
1. locked by automation, exit with an error
|
||||
3. `apps/acceptance/a`
|
||||
4. `apps/acceptance/a/saas-app`
|
||||
5. `apps/acceptance/a/saas-app/develop`
|
||||
3. Deploy job exits with an error, _does not_ deploy
|
||||
3. Automation pipeline ends
|
||||
1. Final job calls `deploy-lock unlock apps/acceptance --type automation`
|
||||
1. Specifying the `--type` during `unlock` prevents automation/deploy jobs from accidentally removing an incident
|
||||
2. If the final automation job does not run, the lock will still expire after 90 minutes (`--duration`)
|
||||
4. Retry `saas-app` deploy job
|
||||
1. No lock, runs normally
|
||||
|
||||
#### Prevent a deploy during a production incident
|
||||
|
||||
This would be used to prevent an application deploy during an infrastructure outage, to make sure existing pods
|
||||
continue running.
|
||||
|
||||
1. DevOps receives an alert and declares an incident for the `apps/production/a` cluster
|
||||
2. The first responder runs `deploy-lock lock apps/production --type incident --duration 6h`
|
||||
1. This locks _both_ production clusters while we shift traffic to the working one
|
||||
3. Someone merges code into `main` of `auth-app`
|
||||
1. The `auth-app` pipeline runs a deploy job
|
||||
2. The deploy job calls `deploy-lock check apps/production/a/auth-app`, which recursively checks:
|
||||
1. `apps`
|
||||
2. `apps/production`
|
||||
1. locked by incident, exit with an error
|
||||
3. `apps/production/a`
|
||||
4. `apps/production/a/auth-app`
|
||||
3. Deploy job exits with an error, _does not_ deploy
|
||||
4. Incident is resolved
|
||||
1. First responder runs `deploy-lock unlock apps/production --type incident`
|
||||
5. Retry `auth-app` deploy job
|
||||
1. No lock, runs normally
|
||||
|
||||
#### Prevent duplicate deploys of the same service from conflicting
|
||||
|
||||
This would be used to prevent multiple simultaneous deploys of the same project from conflicting with one another, in
|
||||
a service without ephemeral environments/branch switching.
|
||||
|
||||
1. Someone starts a pipeline on `feature/foo` of `chat-app`
|
||||
1. The `chat-app` pipeline runs a deploy job
|
||||
2. The deploy job calls `deploy-lock lock apps/staging/a/chat-app`
|
||||
2. Someone else starts a pipeline on `feature/bar` of `chat-app`
|
||||
1. The `chat-app` pipeline runs another deploy job
|
||||
1. The first one has not finished and is still mid-rollout
|
||||
2. The second deploy job calls `deploy-lock lock apps/staging/a/chat-app`, which recursively checks:
|
||||
1. `apps`
|
||||
2. `apps/staging`
|
||||
3. `apps/staging/a`
|
||||
4. `apps/staging/a/chat-app`
|
||||
1. locked by deploy, exit with an error
|
||||
5. `lock` implies `check`
|
||||
3. Second deploy job fails with an error, _does not_ deploy
|
||||
3. First deploy succeeds
|
||||
1. Deploy job calls `deploy-lock unlock apps/staging/a/chat-app`
|
||||
4. Second deploy job can be retried
|
||||
1. No lock, runs normally
|
||||
|
||||
### Deploy Path
|
||||
|
||||
The path to a service, starting with the cluster and environment: `apps/staging/a/auth-app`.
|
||||
|
||||
Path components may include:
|
||||
|
||||
- cluster
|
||||
- env (`account`)
|
||||
- target
|
||||
- service (namespace)
|
||||
- branch (`ref`)
|
||||
|
||||
When _locking_ a path, only the leaf path is locked, not parents.
|
||||
|
||||
When _checking_ a path, each segment is checked recursively, so a lock at `apps/staging` will prevent all services
|
||||
from being deployed into both the `apps/staging/a` and `apps/staging/b` clusters.
|
||||
|
||||
- cluster comes first because that is how we structure the git repositories (repo = cluster, branch = env)
|
||||
- to lock multiple clusters in the same environment, run the command repeatedly with the same lock data
|
||||
- to lock a specific branch, put it in the path: `apps/staging/a/auth-app/main`
|
||||
|
||||
Ultimately, the deploy path's layout should follow the hierarchy of resources that you want to lock. One potential
|
||||
order, for a multi-cloud Kubernetes architecture, is:
|
||||
|
||||
- cloud
|
||||
- account
|
||||
- region
|
||||
- network
|
||||
- cluster
|
||||
- namespace
|
||||
- resource name
|
||||
|
||||
Such as `aws/staging/us-east-1/apps/a/auth-app/api` or `gcp/production/us-east4-a/tools/gitlab/runner/ci`.
|
||||
|
||||
Including the region in the path can be limiting, but also allows locking an entire provider-region in case of serious
|
||||
upstream incidents.
|
||||
[More details here.](./docs/concepts.md#deploy-path)
|
||||
|
||||
### Lock Data
|
||||
|
||||
|
@ -175,24 +68,6 @@ interface Lock {
|
|||
|
||||
If `$CI` is not set, the `ci` sub-struct will not be present.
|
||||
|
||||
### Messaging
|
||||
|
||||
- create a new lock: `locked ${path} for ${type:friendly} until ${expires_at:datetime}`
|
||||
- > Locked `apps/acceptance/a` for a deploy until Sat 31 Dec, 12:00
|
||||
- > Locked `gitlab/production` for an incident until Sat 31 Dec, 12:00
|
||||
- error, existing lock: `error: ${path} is locked until ${expires_at:datetime} by ${type:friendly} in ${source}`
|
||||
- > Error: `apps/acceptance` is locked until Sat 31 Dec, 12:00 by an automation run in `testing/staging`.
|
||||
|
||||
#### Friendly Types
|
||||
|
||||
Friendly strings for `type`:
|
||||
|
||||
- `automation`: `An automation run`
|
||||
- `deploy`: `A deploy`
|
||||
- `freeze`: `A release freeze`
|
||||
- `incident`: `An incident`
|
||||
- `maintenance`: `A maintenance window`
|
||||
|
||||
## Usage
|
||||
|
||||
### Command-line Interface
|
||||
|
@ -360,83 +235,3 @@ Friendly strings for `type`:
|
|||
- health check
|
||||
|
||||
![readme bottom banner](docs/banner-bottom.png)
|
||||
|
||||
## Development
|
||||
|
||||
### Features
|
||||
|
||||
- in-memory data store, mostly for testing
|
||||
- DynamoDB data store
|
||||
- lock paths and recursive checking
|
||||
- infer lock data from CI variables
|
||||
|
||||
### Building
|
||||
|
||||
1. Clone with `git clone git@github.com:ssube/deploy-lock.git`
|
||||
2. Switch into the project directory with `cd deploy-lock`
|
||||
3. Run a full lint, build, and test with `make ci`
|
||||
4. Run the program's help with `make run-help` or `node out/src/index.js --help`
|
||||
|
||||
### Testing
|
||||
|
||||
You can test locally without a real DDB table using https://hub.docker.com/r/amazon/dynamodb-local.
|
||||
|
||||
1. Launch DynamoDB Local with `podman run --rm -p 8000:8000 docker.io/amazon/dynamodb-local`
|
||||
2. Create a profile with `aws configure --profile localddb`
|
||||
1. placeholder tokens
|
||||
2. us-east-1 region
|
||||
3. json output
|
||||
3. Create a `locks` table with `aws dynamodb --endpoint-url http://localhost:8000 --profile localddb create-table --attribute-definitions 'AttributeName=path,AttributeType=S' --table-name locks --key-schema 'AttributeName=path,KeyType=HASH' --billing-mode PAY_PER_REQUEST`
|
||||
4. Run commands using `AWS_PROFILE=localddb deploy-lock --storage dynamo --table locks --endpoint http://localhost:8000 ...`
|
||||
|
||||
### TODOs
|
||||
|
||||
1. Infer lock source from arguments/environment, like `CI_` variables
|
||||
2. SQL data store, with history (don't need to remove old records)
|
||||
3. S3 data store
|
||||
4. Kubernetes admission controller with configurable paths
|
||||
|
||||
Other potential data stores could include: flat file, kubernetes configmap, etcd itself, consul, redis.
|
||||
|
||||
### Questions
|
||||
|
||||
1. In the [deploy path](#deploy-path), should account come before region or region before account?
|
||||
1. `aws/us-east-1/staging` vs `aws/staging/us-east-1`
|
||||
2. This is purely a recommendation in the docs, `lock.path` and `lock.source` will both be slash-delimited or array
|
||||
paths.
|
||||
2. Should there be an `update` or `replace` command?
|
||||
1. Probably not, at least not without lock history or multi-party locks.
|
||||
2. When the data store can keep old locks, `replace` could expire an existing lock and create a new one
|
||||
3. With multi-party locks, `update` could update the `expires_at` and add a new author
|
||||
3. Should `--recursive` be available for `lock` and `unlock`, or only `check`?
|
||||
1. TBD
|
||||
2. A recursive `lock` would write multiple records
|
||||
3. A recursive `unlock` could delete multiple records
|
||||
4. Should locks have multiple authors?
|
||||
1. TBD
|
||||
2. It doesn't make sense to have more than one active lock for the same path
|
||||
1. Or does it?
|
||||
2. Different levels can use `--allow` without creating multiple locks
|
||||
3. But having multiple authors would allow for multi-party locks
|
||||
1. for CI: `[gitlab, $GITLAB_USER_NAME]`
|
||||
2. for an incident: `[first-responder, incident-commander]`
|
||||
4. Each author has to `unlock` before the lock is removed/released
|
||||
5. Should `LockData.env` be a string/array, like `LockData.path`?
|
||||
1. Done
|
||||
2. Very probably yes, because otherwise it will need `env.cloud`, `env.network`, etc, and those
|
||||
are not always predictable/present.
|
||||
6. Should there be an `--allow`/`LockData.allow` field?
|
||||
1. Probably yes
|
||||
2. When running `check --type`, if `LockData.allow` includes `--type`, it will be allowed
|
||||
1. `freeze` should allow `automation`, but not `deploy`
|
||||
2. `incident` could allow `deploy`, but not `automation`
|
||||
7. Wildcards in paths?
|
||||
1. Probably no, it will become confusing pretty quickly, and KV stores do not support them consistently (or at all).
|
||||
8. Authz for API mutations?
|
||||
1. If there is a REST API, it might need authn/authz.
|
||||
2. Keeping the API private _could_ work.
|
||||
3. Authorization should be scoped by path.
|
||||
9. How should the `AdmissionReview` fields be mapped to path?
|
||||
1. This could vary by user and may need to be configurable.
|
||||
2. Probably using an argument, `--admission-path`
|
||||
3. Will eventually need to use jsonpath for access to `userInfo.groups` list or maps
|
||||
|
|
|
@ -12,9 +12,40 @@ Explain the following:
|
|||
|
||||
## Deploy Path
|
||||
|
||||
Anything that can be deployed or locked needs to be identified by a consistent path.
|
||||
The path to a service, starting with the cluster and environment: `apps/staging/a/auth-app`.
|
||||
|
||||
Paths are user-defined and slash-delimited.
|
||||
Path components may include:
|
||||
|
||||
- cluster
|
||||
- env (`account`)
|
||||
- target
|
||||
- service (namespace)
|
||||
- branch (`ref`)
|
||||
|
||||
When _locking_ a path, only the leaf path is locked, not parents.
|
||||
|
||||
When _checking_ a path, each segment is checked recursively, so a lock at `apps/staging` will prevent all services
|
||||
from being deployed into both the `apps/staging/a` and `apps/staging/b` clusters.
|
||||
|
||||
- cluster comes first because that is how we structure the git repositories (repo = cluster, branch = env)
|
||||
- to lock multiple clusters in the same environment, run the command repeatedly with the same lock data
|
||||
- to lock a specific branch, put it in the path: `apps/staging/a/auth-app/main`
|
||||
|
||||
Ultimately, the deploy path's layout should follow the hierarchy of resources that you want to lock. One potential
|
||||
order, for a multi-cloud Kubernetes architecture, is:
|
||||
|
||||
- cloud
|
||||
- account
|
||||
- region
|
||||
- network
|
||||
- cluster
|
||||
- namespace
|
||||
- resource name
|
||||
|
||||
Such as `aws/staging/us-east-1/apps/a/auth-app/api` or `gcp/production/us-east4-a/tools/gitlab/runner/ci`.
|
||||
|
||||
Including the region in the path can be limiting, but also allows locking an entire provider-region in case of serious
|
||||
upstream incidents.
|
||||
|
||||
## Exclusive and Partial Locks
|
||||
|
||||
|
|
|
@ -0,0 +1,104 @@
|
|||
# Developing
|
||||
|
||||
## Contents
|
||||
|
||||
- [Developing](#developing)
|
||||
- [Contents](#contents)
|
||||
- [Building](#building)
|
||||
- [Testing](#testing)
|
||||
- [Messaging](#messaging)
|
||||
- [Friendly Types](#friendly-types)
|
||||
- [TODOs](#todos)
|
||||
- [Features](#features)
|
||||
- [Questions](#questions)
|
||||
|
||||
## Building
|
||||
|
||||
1. Clone with `git clone git@github.com:ssube/deploy-lock.git`
|
||||
2. Switch into the project directory with `cd deploy-lock`
|
||||
3. Run a full lint, build, and test with `make ci`
|
||||
4. Run the program's help with `make run-help` or `node out/src/index.js --help`
|
||||
|
||||
## Testing
|
||||
|
||||
You can test locally without a real DDB table using https://hub.docker.com/r/amazon/dynamodb-local.
|
||||
|
||||
1. Launch DynamoDB Local with `podman run --rm -p 8000:8000 docker.io/amazon/dynamodb-local`
|
||||
2. Create a profile with `aws configure --profile localddb`
|
||||
1. placeholder tokens
|
||||
2. us-east-1 region
|
||||
3. json output
|
||||
3. Create a `locks` table with `aws dynamodb --endpoint-url http://localhost:8000 --profile localddb create-table --attribute-definitions 'AttributeName=path,AttributeType=S' --table-name locks --key-schema 'AttributeName=path,KeyType=HASH' --billing-mode PAY_PER_REQUEST`
|
||||
4. Run commands using `AWS_PROFILE=localddb deploy-lock --storage dynamo --table locks --endpoint http://localhost:8000 ...`
|
||||
|
||||
## Messaging
|
||||
|
||||
- create a new lock: `locked ${path} for ${type:friendly} until ${expires_at:datetime}`
|
||||
- > Locked `apps/acceptance/a` for a deploy until Sat 31 Dec, 12:00
|
||||
- > Locked `gitlab/production` for an incident until Sat 31 Dec, 12:00
|
||||
- error, existing lock: `error: ${path} is locked until ${expires_at:datetime} by ${type:friendly} in ${source}`
|
||||
- > Error: `apps/acceptance` is locked until Sat 31 Dec, 12:00 by an automation run in `testing/staging`.
|
||||
|
||||
### Friendly Types
|
||||
|
||||
Friendly strings for `type`:
|
||||
|
||||
- `automation`: `An automation run`
|
||||
- `deploy`: `A deploy`
|
||||
- `freeze`: `A release freeze`
|
||||
- `incident`: `An incident`
|
||||
- `maintenance`: `A maintenance window`
|
||||
|
||||
## TODOs
|
||||
|
||||
### Features
|
||||
|
||||
1. Infer lock source from arguments/environment, like `CI_` variables
|
||||
2. SQL data store, with history (don't need to remove old records)
|
||||
3. S3 data store
|
||||
4. Kubernetes admission controller with configurable paths
|
||||
|
||||
Other potential data stores could include: flat file, kubernetes configmap, etcd itself, consul, redis.
|
||||
|
||||
### Questions
|
||||
|
||||
1. In the [deploy path](../READMDE.md#deploy-path), should account come before region or region before account?
|
||||
1. `aws/us-east-1/staging` vs `aws/staging/us-east-1`
|
||||
2. This is purely a recommendation in the docs, `lock.path` and `lock.source` will both be slash-delimited or array
|
||||
paths.
|
||||
2. Should there be an `update` or `replace` command?
|
||||
1. Probably not, at least not without lock history or multi-party locks.
|
||||
2. When the data store can keep old locks, `replace` could expire an existing lock and create a new one
|
||||
3. With multi-party locks, `update` could update the `expires_at` and add a new author
|
||||
3. Should `--recursive` be available for `lock` and `unlock`, or only `check`?
|
||||
1. TBD
|
||||
2. A recursive `lock` would write multiple records
|
||||
3. A recursive `unlock` could delete multiple records
|
||||
4. Should locks have multiple authors?
|
||||
1. TBD
|
||||
2. It doesn't make sense to have more than one active lock for the same path
|
||||
1. Or does it?
|
||||
2. Different levels can use `--allow` without creating multiple locks
|
||||
3. But having multiple authors would allow for multi-party locks
|
||||
1. for CI: `[gitlab, $GITLAB_USER_NAME]`
|
||||
2. for an incident: `[first-responder, incident-commander]`
|
||||
4. Each author has to `unlock` before the lock is removed/released
|
||||
5. Should `LockData.env` be a string/array, like `LockData.path`?
|
||||
1. Done
|
||||
2. Very probably yes, because otherwise it will need `env.cloud`, `env.network`, etc, and those
|
||||
are not always predictable/present.
|
||||
6. Should there be an `--allow`/`LockData.allow` field?
|
||||
1. Probably yes
|
||||
2. When running `check --type`, if `LockData.allow` includes `--type`, it will be allowed
|
||||
1. `freeze` should allow `automation`, but not `deploy`
|
||||
2. `incident` could allow `deploy`, but not `automation`
|
||||
7. Wildcards in paths?
|
||||
1. Probably no, it will become confusing pretty quickly, and KV stores do not support them consistently (or at all).
|
||||
8. Authz for API mutations?
|
||||
1. If there is a REST API, it might need authn/authz.
|
||||
2. Keeping the API private _could_ work.
|
||||
3. Authorization should be scoped by path.
|
||||
9. How should the `AdmissionReview` fields be mapped to path?
|
||||
1. This could vary by user and may need to be configurable.
|
||||
2. Probably using an argument, `--admission-path`
|
||||
3. Will eventually need to use jsonpath for access to `userInfo.groups` list or maps
|
|
@ -4,38 +4,105 @@
|
|||
|
||||
- [Getting Started](#getting-started)
|
||||
- [Contents](#contents)
|
||||
- [Why use this tool?](#why-use-this-tool)
|
||||
- [How to use this tool](#how-to-use-this-tool)
|
||||
- [Why Use This Tool?](#why-use-this-tool)
|
||||
- [How to Use This Tool](#how-to-use-this-tool)
|
||||
- [Setup and Prerequisites](#setup-and-prerequisites)
|
||||
- [Install and first run](#install-and-first-run)
|
||||
- [Lock something](#lock-something)
|
||||
- [Check and unlock](#check-and-unlock)
|
||||
- [Daily workflow](#daily-workflow)
|
||||
- [As a Devops](#as-a-devops)
|
||||
- [As a Release Manager](#as-a-release-manager)
|
||||
- [Configure DynamoDB storage](#configure-dynamodb-storage)
|
||||
- [Lock Something](#lock-something)
|
||||
- [Check and Unlock](#check-and-unlock)
|
||||
- [Common Workflows](#common-workflows)
|
||||
- [Prevent a deploy during an automation run](#prevent-a-deploy-during-an-automation-run)
|
||||
- [Prevent a deploy during a production incident](#prevent-a-deploy-during-a-production-incident)
|
||||
- [Prevent duplicate deploys of the same service from conflicting](#prevent-duplicate-deploys-of-the-same-service-from-conflicting)
|
||||
- [Using in CI](#using-in-ci)
|
||||
- [Automation](#automation)
|
||||
- [Deploy](#deploy)
|
||||
- [Using in Kubernetes](#using-in-kubernetes)
|
||||
- [Validating Admission Controller](#validating-admission-controller)
|
||||
|
||||
## Why use this tool?
|
||||
## Why Use This Tool?
|
||||
|
||||
## How to use this tool
|
||||
## How to Use This Tool
|
||||
|
||||
### Setup and Prerequisites
|
||||
|
||||
### Install and first run
|
||||
### Configure DynamoDB storage
|
||||
|
||||
### Lock something
|
||||
### Lock Something
|
||||
|
||||
### Check and unlock
|
||||
### Check and Unlock
|
||||
|
||||
## Daily workflow
|
||||
## Common Workflows
|
||||
|
||||
### As a Devops
|
||||
### Prevent a deploy during an automation run
|
||||
|
||||
### As a Release Manager
|
||||
This would be used to prevent an application deploy during a test automation run, to make sure the application does not
|
||||
restart or change versions and invalidate the test results.
|
||||
|
||||
1. QA starts an automation run
|
||||
1. Automation calls `deploy-lock lock apps/acceptance --type automation --duration 90m`
|
||||
2. Someone merges code into `develop` of `saas-app`
|
||||
1. The `saas-app` pipeline runs a deploy job
|
||||
2. The deploy job calls `deploy-lock check apps/acceptance/a/saas-app/develop`, which recursively checks:
|
||||
1. `apps`
|
||||
2. `apps/acceptance`
|
||||
1. locked by automation, exit with an error
|
||||
3. `apps/acceptance/a`
|
||||
4. `apps/acceptance/a/saas-app`
|
||||
5. `apps/acceptance/a/saas-app/develop`
|
||||
3. Deploy job exits with an error, _does not_ deploy
|
||||
3. Automation pipeline ends
|
||||
1. Final job calls `deploy-lock unlock apps/acceptance --type automation`
|
||||
1. Specifying the `--type` during `unlock` prevents automation/deploy jobs from accidentally removing an incident
|
||||
2. If the final automation job does not run, the lock will still expire after 90 minutes (`--duration`)
|
||||
4. Retry `saas-app` deploy job
|
||||
1. No lock, runs normally
|
||||
|
||||
### Prevent a deploy during a production incident
|
||||
|
||||
This would be used to prevent an application deploy during an infrastructure outage, to make sure existing pods
|
||||
continue running.
|
||||
|
||||
1. DevOps receives an alert and declares an incident for the `apps/production/a` cluster
|
||||
2. The first responder runs `deploy-lock lock apps/production --type incident --duration 6h`
|
||||
1. This locks _both_ production clusters while we shift traffic to the working one
|
||||
3. Someone merges code into `main` of `auth-app`
|
||||
1. The `auth-app` pipeline runs a deploy job
|
||||
2. The deploy job calls `deploy-lock check apps/production/a/auth-app`, which recursively checks:
|
||||
1. `apps`
|
||||
2. `apps/production`
|
||||
1. locked by incident, exit with an error
|
||||
3. `apps/production/a`
|
||||
4. `apps/production/a/auth-app`
|
||||
3. Deploy job exits with an error, _does not_ deploy
|
||||
4. Incident is resolved
|
||||
1. First responder runs `deploy-lock unlock apps/production --type incident`
|
||||
5. Retry `auth-app` deploy job
|
||||
1. No lock, runs normally
|
||||
|
||||
### Prevent duplicate deploys of the same service from conflicting
|
||||
|
||||
This would be used to prevent multiple simultaneous deploys of the same project from conflicting with one another, in
|
||||
a service without ephemeral environments/branch switching.
|
||||
|
||||
1. Someone starts a pipeline on `feature/foo` of `chat-app`
|
||||
1. The `chat-app` pipeline runs a deploy job
|
||||
2. The deploy job calls `deploy-lock lock apps/staging/a/chat-app`
|
||||
2. Someone else starts a pipeline on `feature/bar` of `chat-app`
|
||||
1. The `chat-app` pipeline runs another deploy job
|
||||
1. The first one has not finished and is still mid-rollout
|
||||
2. The second deploy job calls `deploy-lock lock apps/staging/a/chat-app`, which recursively checks:
|
||||
1. `apps`
|
||||
2. `apps/staging`
|
||||
3. `apps/staging/a`
|
||||
4. `apps/staging/a/chat-app`
|
||||
1. locked by deploy, exit with an error
|
||||
5. `lock` implies `check`
|
||||
3. Second deploy job fails with an error, _does not_ deploy
|
||||
3. First deploy succeeds
|
||||
1. Deploy job calls `deploy-lock unlock apps/staging/a/chat-app`
|
||||
4. Second deploy job can be retried
|
||||
1. No lock, runs normally
|
||||
|
||||
## Using in CI
|
||||
|
||||
|
|
Loading…
Reference in New Issue