1
0
Fork 0

split up detail docs

This commit is contained in:
Sean Sube 2023-01-04 10:03:12 -06:00
parent cfd72a3efd
commit 302bd8f4ca
4 changed files with 234 additions and 237 deletions

233
README.md
View File

@ -3,21 +3,21 @@
This is a tool to lock a cluster or service, in order to prevent people from deploying changes during test automation or
restarting pods during an infrastructure incident.
![readme banner top](docs/banner-top.png)
## Features
- in-memory data store, mostly for testing
- DynamoDB data store
- lock paths and recursive checking
- infer lock data from CI variables
## Contents
- [Deploy Lock](#deploy-lock)
- [Features](#features)
- [Contents](#contents)
- [Abstract](#abstract)
- [Example Usage](#example-usage)
- [Prevent a deploy during an automation run](#prevent-a-deploy-during-an-automation-run)
- [Prevent a deploy during a production incident](#prevent-a-deploy-during-a-production-incident)
- [Prevent duplicate deploys of the same service from conflicting](#prevent-duplicate-deploys-of-the-same-service-from-conflicting)
- [Deploy Path](#deploy-path)
- [Concepts](#concepts)
- [Lock Path](#lock-path)
- [Lock Data](#lock-data)
- [Messaging](#messaging)
- [Friendly Types](#friendly-types)
- [Usage](#usage)
- [Command-line Interface](#command-line-interface)
- [Basic Options](#basic-options)
@ -26,123 +26,16 @@ restarting pods during an infrastructure incident.
- [Admission Controller Options](#admission-controller-options)
- [REST API](#rest-api)
- [Endpoints](#endpoints)
- [Development](#development)
- [Features](#features)
- [Building](#building)
- [Testing](#testing)
- [TODOs](#todos)
- [Questions](#questions)
## Abstract
## Concepts
### Example Usage
![readme banner top](docs/banner-top.png)
#### Prevent a deploy during an automation run
### Lock Path
This would be used to prevent an application deploy during a test automation run, to make sure the application does not
restart or change versions and invalidate the test results.
Briefly describe paths.
1. QA starts an automation run
1. Automation calls `deploy-lock lock apps/acceptance --type automation --duration 90m`
2. Someone merges code into `develop` of `saas-app`
1. The `saas-app` pipeline runs a deploy job
2. The deploy job calls `deploy-lock check apps/acceptance/a/saas-app/develop`, which recursively checks:
1. `apps`
2. `apps/acceptance`
1. locked by automation, exit with an error
3. `apps/acceptance/a`
4. `apps/acceptance/a/saas-app`
5. `apps/acceptance/a/saas-app/develop`
3. Deploy job exits with an error, _does not_ deploy
3. Automation pipeline ends
1. Final job calls `deploy-lock unlock apps/acceptance --type automation`
1. Specifying the `--type` during `unlock` prevents automation/deploy jobs from accidentally removing an incident
2. If the final automation job does not run, the lock will still expire after 90 minutes (`--duration`)
4. Retry `saas-app` deploy job
1. No lock, runs normally
#### Prevent a deploy during a production incident
This would be used to prevent an application deploy during an infrastructure outage, to make sure existing pods
continue running.
1. DevOps receives an alert and declares an incident for the `apps/production/a` cluster
2. The first responder runs `deploy-lock lock apps/production --type incident --duration 6h`
1. This locks _both_ production clusters while we shift traffic to the working one
3. Someone merges code into `main` of `auth-app`
1. The `auth-app` pipeline runs a deploy job
2. The deploy job calls `deploy-lock check apps/production/a/auth-app`, which recursively checks:
1. `apps`
2. `apps/production`
1. locked by incident, exit with an error
3. `apps/production/a`
4. `apps/production/a/auth-app`
3. Deploy job exits with an error, _does not_ deploy
4. Incident is resolved
1. First responder runs `deploy-lock unlock apps/production --type incident`
5. Retry `auth-app` deploy job
1. No lock, runs normally
#### Prevent duplicate deploys of the same service from conflicting
This would be used to prevent multiple simultaneous deploys of the same project from conflicting with one another, in
a service without ephemeral environments/branch switching.
1. Someone starts a pipeline on `feature/foo` of `chat-app`
1. The `chat-app` pipeline runs a deploy job
2. The deploy job calls `deploy-lock lock apps/staging/a/chat-app`
2. Someone else starts a pipeline on `feature/bar` of `chat-app`
1. The `chat-app` pipeline runs another deploy job
1. The first one has not finished and is still mid-rollout
2. The second deploy job calls `deploy-lock lock apps/staging/a/chat-app`, which recursively checks:
1. `apps`
2. `apps/staging`
3. `apps/staging/a`
4. `apps/staging/a/chat-app`
1. locked by deploy, exit with an error
5. `lock` implies `check`
3. Second deploy job fails with an error, _does not_ deploy
3. First deploy succeeds
1. Deploy job calls `deploy-lock unlock apps/staging/a/chat-app`
4. Second deploy job can be retried
1. No lock, runs normally
### Deploy Path
The path to a service, starting with the cluster and environment: `apps/staging/a/auth-app`.
Path components may include:
- cluster
- env (`account`)
- target
- service (namespace)
- branch (`ref`)
When _locking_ a path, only the leaf path is locked, not parents.
When _checking_ a path, each segment is checked recursively, so a lock at `apps/staging` will prevent all services
from being deployed into both the `apps/staging/a` and `apps/staging/b` clusters.
- cluster comes first because that is how we structure the git repositories (repo = cluster, branch = env)
- to lock multiple clusters in the same environment, run the command repeatedly with the same lock data
- to lock a specific branch, put it in the path: `apps/staging/a/auth-app/main`
Ultimately, the deploy path's layout should follow the hierarchy of resources that you want to lock. One potential
order, for a multi-cloud Kubernetes architecture, is:
- cloud
- account
- region
- network
- cluster
- namespace
- resource name
Such as `aws/staging/us-east-1/apps/a/auth-app/api` or `gcp/production/us-east4-a/tools/gitlab/runner/ci`.
Including the region in the path can be limiting, but also allows locking an entire provider-region in case of serious
upstream incidents.
[More details here.](./docs/concepts.md#deploy-path)
### Lock Data
@ -175,24 +68,6 @@ interface Lock {
If `$CI` is not set, the `ci` sub-struct will not be present.
### Messaging
- create a new lock: `locked ${path} for ${type:friendly} until ${expires_at:datetime}`
- > Locked `apps/acceptance/a` for a deploy until Sat 31 Dec, 12:00
- > Locked `gitlab/production` for an incident until Sat 31 Dec, 12:00
- error, existing lock: `error: ${path} is locked until ${expires_at:datetime} by ${type:friendly} in ${source}`
- > Error: `apps/acceptance` is locked until Sat 31 Dec, 12:00 by an automation run in `testing/staging`.
#### Friendly Types
Friendly strings for `type`:
- `automation`: `An automation run`
- `deploy`: `A deploy`
- `freeze`: `A release freeze`
- `incident`: `An incident`
- `maintenance`: `A maintenance window`
## Usage
### Command-line Interface
@ -360,83 +235,3 @@ Friendly strings for `type`:
- health check
![readme bottom banner](docs/banner-bottom.png)
## Development
### Features
- in-memory data store, mostly for testing
- DynamoDB data store
- lock paths and recursive checking
- infer lock data from CI variables
### Building
1. Clone with `git clone git@github.com:ssube/deploy-lock.git`
2. Switch into the project directory with `cd deploy-lock`
3. Run a full lint, build, and test with `make ci`
4. Run the program's help with `make run-help` or `node out/src/index.js --help`
### Testing
You can test locally without a real DDB table using https://hub.docker.com/r/amazon/dynamodb-local.
1. Launch DynamoDB Local with `podman run --rm -p 8000:8000 docker.io/amazon/dynamodb-local`
2. Create a profile with `aws configure --profile localddb`
1. placeholder tokens
2. us-east-1 region
3. json output
3. Create a `locks` table with `aws dynamodb --endpoint-url http://localhost:8000 --profile localddb create-table --attribute-definitions 'AttributeName=path,AttributeType=S' --table-name locks --key-schema 'AttributeName=path,KeyType=HASH' --billing-mode PAY_PER_REQUEST`
4. Run commands using `AWS_PROFILE=localddb deploy-lock --storage dynamo --table locks --endpoint http://localhost:8000 ...`
### TODOs
1. Infer lock source from arguments/environment, like `CI_` variables
2. SQL data store, with history (don't need to remove old records)
3. S3 data store
4. Kubernetes admission controller with configurable paths
Other potential data stores could include: flat file, kubernetes configmap, etcd itself, consul, redis.
### Questions
1. In the [deploy path](#deploy-path), should account come before region or region before account?
1. `aws/us-east-1/staging` vs `aws/staging/us-east-1`
2. This is purely a recommendation in the docs, `lock.path` and `lock.source` will both be slash-delimited or array
paths.
2. Should there be an `update` or `replace` command?
1. Probably not, at least not without lock history or multi-party locks.
2. When the data store can keep old locks, `replace` could expire an existing lock and create a new one
3. With multi-party locks, `update` could update the `expires_at` and add a new author
3. Should `--recursive` be available for `lock` and `unlock`, or only `check`?
1. TBD
2. A recursive `lock` would write multiple records
3. A recursive `unlock` could delete multiple records
4. Should locks have multiple authors?
1. TBD
2. It doesn't make sense to have more than one active lock for the same path
1. Or does it?
2. Different levels can use `--allow` without creating multiple locks
3. But having multiple authors would allow for multi-party locks
1. for CI: `[gitlab, $GITLAB_USER_NAME]`
2. for an incident: `[first-responder, incident-commander]`
4. Each author has to `unlock` before the lock is removed/released
5. Should `LockData.env` be a string/array, like `LockData.path`?
1. Done
2. Very probably yes, because otherwise it will need `env.cloud`, `env.network`, etc, and those
are not always predictable/present.
6. Should there be an `--allow`/`LockData.allow` field?
1. Probably yes
2. When running `check --type`, if `LockData.allow` includes `--type`, it will be allowed
1. `freeze` should allow `automation`, but not `deploy`
2. `incident` could allow `deploy`, but not `automation`
7. Wildcards in paths?
1. Probably no, it will become confusing pretty quickly, and KV stores do not support them consistently (or at all).
8. Authz for API mutations?
1. If there is a REST API, it might need authn/authz.
2. Keeping the API private _could_ work.
3. Authorization should be scoped by path.
9. How should the `AdmissionReview` fields be mapped to path?
1. This could vary by user and may need to be configurable.
2. Probably using an argument, `--admission-path`
3. Will eventually need to use jsonpath for access to `userInfo.groups` list or maps

View File

@ -12,9 +12,40 @@ Explain the following:
## Deploy Path
Anything that can be deployed or locked needs to be identified by a consistent path.
The path to a service, starting with the cluster and environment: `apps/staging/a/auth-app`.
Paths are user-defined and slash-delimited.
Path components may include:
- cluster
- env (`account`)
- target
- service (namespace)
- branch (`ref`)
When _locking_ a path, only the leaf path is locked, not parents.
When _checking_ a path, each segment is checked recursively, so a lock at `apps/staging` will prevent all services
from being deployed into both the `apps/staging/a` and `apps/staging/b` clusters.
- cluster comes first because that is how we structure the git repositories (repo = cluster, branch = env)
- to lock multiple clusters in the same environment, run the command repeatedly with the same lock data
- to lock a specific branch, put it in the path: `apps/staging/a/auth-app/main`
Ultimately, the deploy path's layout should follow the hierarchy of resources that you want to lock. One potential
order, for a multi-cloud Kubernetes architecture, is:
- cloud
- account
- region
- network
- cluster
- namespace
- resource name
Such as `aws/staging/us-east-1/apps/a/auth-app/api` or `gcp/production/us-east4-a/tools/gitlab/runner/ci`.
Including the region in the path can be limiting, but also allows locking an entire provider-region in case of serious
upstream incidents.
## Exclusive and Partial Locks

104
docs/developing.md Normal file
View File

@ -0,0 +1,104 @@
# Developing
## Contents
- [Developing](#developing)
- [Contents](#contents)
- [Building](#building)
- [Testing](#testing)
- [Messaging](#messaging)
- [Friendly Types](#friendly-types)
- [TODOs](#todos)
- [Features](#features)
- [Questions](#questions)
## Building
1. Clone with `git clone git@github.com:ssube/deploy-lock.git`
2. Switch into the project directory with `cd deploy-lock`
3. Run a full lint, build, and test with `make ci`
4. Run the program's help with `make run-help` or `node out/src/index.js --help`
## Testing
You can test locally without a real DDB table using https://hub.docker.com/r/amazon/dynamodb-local.
1. Launch DynamoDB Local with `podman run --rm -p 8000:8000 docker.io/amazon/dynamodb-local`
2. Create a profile with `aws configure --profile localddb`
1. placeholder tokens
2. us-east-1 region
3. json output
3. Create a `locks` table with `aws dynamodb --endpoint-url http://localhost:8000 --profile localddb create-table --attribute-definitions 'AttributeName=path,AttributeType=S' --table-name locks --key-schema 'AttributeName=path,KeyType=HASH' --billing-mode PAY_PER_REQUEST`
4. Run commands using `AWS_PROFILE=localddb deploy-lock --storage dynamo --table locks --endpoint http://localhost:8000 ...`
## Messaging
- create a new lock: `locked ${path} for ${type:friendly} until ${expires_at:datetime}`
- > Locked `apps/acceptance/a` for a deploy until Sat 31 Dec, 12:00
- > Locked `gitlab/production` for an incident until Sat 31 Dec, 12:00
- error, existing lock: `error: ${path} is locked until ${expires_at:datetime} by ${type:friendly} in ${source}`
- > Error: `apps/acceptance` is locked until Sat 31 Dec, 12:00 by an automation run in `testing/staging`.
### Friendly Types
Friendly strings for `type`:
- `automation`: `An automation run`
- `deploy`: `A deploy`
- `freeze`: `A release freeze`
- `incident`: `An incident`
- `maintenance`: `A maintenance window`
## TODOs
### Features
1. Infer lock source from arguments/environment, like `CI_` variables
2. SQL data store, with history (don't need to remove old records)
3. S3 data store
4. Kubernetes admission controller with configurable paths
Other potential data stores could include: flat file, kubernetes configmap, etcd itself, consul, redis.
### Questions
1. In the [deploy path](../READMDE.md#deploy-path), should account come before region or region before account?
1. `aws/us-east-1/staging` vs `aws/staging/us-east-1`
2. This is purely a recommendation in the docs, `lock.path` and `lock.source` will both be slash-delimited or array
paths.
2. Should there be an `update` or `replace` command?
1. Probably not, at least not without lock history or multi-party locks.
2. When the data store can keep old locks, `replace` could expire an existing lock and create a new one
3. With multi-party locks, `update` could update the `expires_at` and add a new author
3. Should `--recursive` be available for `lock` and `unlock`, or only `check`?
1. TBD
2. A recursive `lock` would write multiple records
3. A recursive `unlock` could delete multiple records
4. Should locks have multiple authors?
1. TBD
2. It doesn't make sense to have more than one active lock for the same path
1. Or does it?
2. Different levels can use `--allow` without creating multiple locks
3. But having multiple authors would allow for multi-party locks
1. for CI: `[gitlab, $GITLAB_USER_NAME]`
2. for an incident: `[first-responder, incident-commander]`
4. Each author has to `unlock` before the lock is removed/released
5. Should `LockData.env` be a string/array, like `LockData.path`?
1. Done
2. Very probably yes, because otherwise it will need `env.cloud`, `env.network`, etc, and those
are not always predictable/present.
6. Should there be an `--allow`/`LockData.allow` field?
1. Probably yes
2. When running `check --type`, if `LockData.allow` includes `--type`, it will be allowed
1. `freeze` should allow `automation`, but not `deploy`
2. `incident` could allow `deploy`, but not `automation`
7. Wildcards in paths?
1. Probably no, it will become confusing pretty quickly, and KV stores do not support them consistently (or at all).
8. Authz for API mutations?
1. If there is a REST API, it might need authn/authz.
2. Keeping the API private _could_ work.
3. Authorization should be scoped by path.
9. How should the `AdmissionReview` fields be mapped to path?
1. This could vary by user and may need to be configurable.
2. Probably using an argument, `--admission-path`
3. Will eventually need to use jsonpath for access to `userInfo.groups` list or maps

View File

@ -4,38 +4,105 @@
- [Getting Started](#getting-started)
- [Contents](#contents)
- [Why use this tool?](#why-use-this-tool)
- [How to use this tool](#how-to-use-this-tool)
- [Why Use This Tool?](#why-use-this-tool)
- [How to Use This Tool](#how-to-use-this-tool)
- [Setup and Prerequisites](#setup-and-prerequisites)
- [Install and first run](#install-and-first-run)
- [Lock something](#lock-something)
- [Check and unlock](#check-and-unlock)
- [Daily workflow](#daily-workflow)
- [As a Devops](#as-a-devops)
- [As a Release Manager](#as-a-release-manager)
- [Configure DynamoDB storage](#configure-dynamodb-storage)
- [Lock Something](#lock-something)
- [Check and Unlock](#check-and-unlock)
- [Common Workflows](#common-workflows)
- [Prevent a deploy during an automation run](#prevent-a-deploy-during-an-automation-run)
- [Prevent a deploy during a production incident](#prevent-a-deploy-during-a-production-incident)
- [Prevent duplicate deploys of the same service from conflicting](#prevent-duplicate-deploys-of-the-same-service-from-conflicting)
- [Using in CI](#using-in-ci)
- [Automation](#automation)
- [Deploy](#deploy)
- [Using in Kubernetes](#using-in-kubernetes)
- [Validating Admission Controller](#validating-admission-controller)
## Why use this tool?
## Why Use This Tool?
## How to use this tool
## How to Use This Tool
### Setup and Prerequisites
### Install and first run
### Configure DynamoDB storage
### Lock something
### Lock Something
### Check and unlock
### Check and Unlock
## Daily workflow
## Common Workflows
### As a Devops
### Prevent a deploy during an automation run
### As a Release Manager
This would be used to prevent an application deploy during a test automation run, to make sure the application does not
restart or change versions and invalidate the test results.
1. QA starts an automation run
1. Automation calls `deploy-lock lock apps/acceptance --type automation --duration 90m`
2. Someone merges code into `develop` of `saas-app`
1. The `saas-app` pipeline runs a deploy job
2. The deploy job calls `deploy-lock check apps/acceptance/a/saas-app/develop`, which recursively checks:
1. `apps`
2. `apps/acceptance`
1. locked by automation, exit with an error
3. `apps/acceptance/a`
4. `apps/acceptance/a/saas-app`
5. `apps/acceptance/a/saas-app/develop`
3. Deploy job exits with an error, _does not_ deploy
3. Automation pipeline ends
1. Final job calls `deploy-lock unlock apps/acceptance --type automation`
1. Specifying the `--type` during `unlock` prevents automation/deploy jobs from accidentally removing an incident
2. If the final automation job does not run, the lock will still expire after 90 minutes (`--duration`)
4. Retry `saas-app` deploy job
1. No lock, runs normally
### Prevent a deploy during a production incident
This would be used to prevent an application deploy during an infrastructure outage, to make sure existing pods
continue running.
1. DevOps receives an alert and declares an incident for the `apps/production/a` cluster
2. The first responder runs `deploy-lock lock apps/production --type incident --duration 6h`
1. This locks _both_ production clusters while we shift traffic to the working one
3. Someone merges code into `main` of `auth-app`
1. The `auth-app` pipeline runs a deploy job
2. The deploy job calls `deploy-lock check apps/production/a/auth-app`, which recursively checks:
1. `apps`
2. `apps/production`
1. locked by incident, exit with an error
3. `apps/production/a`
4. `apps/production/a/auth-app`
3. Deploy job exits with an error, _does not_ deploy
4. Incident is resolved
1. First responder runs `deploy-lock unlock apps/production --type incident`
5. Retry `auth-app` deploy job
1. No lock, runs normally
### Prevent duplicate deploys of the same service from conflicting
This would be used to prevent multiple simultaneous deploys of the same project from conflicting with one another, in
a service without ephemeral environments/branch switching.
1. Someone starts a pipeline on `feature/foo` of `chat-app`
1. The `chat-app` pipeline runs a deploy job
2. The deploy job calls `deploy-lock lock apps/staging/a/chat-app`
2. Someone else starts a pipeline on `feature/bar` of `chat-app`
1. The `chat-app` pipeline runs another deploy job
1. The first one has not finished and is still mid-rollout
2. The second deploy job calls `deploy-lock lock apps/staging/a/chat-app`, which recursively checks:
1. `apps`
2. `apps/staging`
3. `apps/staging/a`
4. `apps/staging/a/chat-app`
1. locked by deploy, exit with an error
5. `lock` implies `check`
3. Second deploy job fails with an error, _does not_ deploy
3. First deploy succeeds
1. Deploy job calls `deploy-lock unlock apps/staging/a/chat-app`
4. Second deploy job can be retried
1. No lock, runs normally
## Using in CI