This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Ops Guides

A collection of guides made for Ops.

1 - Deployment

Learn to deploy Chall-Manager, either for production or development purposes.

You can deploy the Chall-Manager in many ways. The following table summarize the properties of each one.

NameMaintainedIsolationScalableJanitor
Kubernetes
Binary⛏️❌¹
Docker✅²
  • ✅ Supported
  • ❌ Unsupported
  • ⛏️ Work In Progress…

¹ We do not harden the configuration in the installation script, but recommend you digging into it more as your security model requires it (especially for production purposes).

² Autoscaling is possible with an hypervisor (e.g. Docker Swarm).

Kubernetes

This deployment strategy guarantee you a valid infrastructure regarding our functionalities and security guidelines. Moreover, if you are afraid of Pulumi you’ll have trouble creating scenarios, so it’s a good place to start !

The requirements are:

# Get the repository and its own Pulumi factory
git clone git@github.com:ctfer-io/chall-manager.git
cd chall-manager/deploy

# Use it straightly !
# Don't forget to configure your stack if necessary.
# Refer to Pulumi's doc if necessary.
pulumi up

Now, you’re done !

Micro Services Architecture of chall-manager deployed in a Kubernetes cluster.

Binary

To install it on a host machine as systemd services and timers, you can run the following script.

curl -fsSL https://github.com/ctfer-io/chall-manager/blob/main/hack/setup.sh |  sh

It requires:

Don’t forget that chall-manager requires Pulumi to be installed.

Docker

If you are unsatisfied of the way the binary install works on installation, unexisting update mecanisms or isolation, the Docker install may fit your needs.

To deploy it using Docker images, you can use the official images:

You can verify their integrity using the following commands.

slsa-verifier slsa-verifier verify-image "ctferio/chall-manager:<tag>@sha256:<digest>" \
    --source-uri "github.com/ctfer-io/chall-manager" \
    --source-tag "<tag>"

slsa-verifier slsa-verifier verify-image "ctferio/chall-manager-janitor:<tag>@sha256:<digest>" \
    --source-uri "github.com/ctfer-io/chall-manager" \
    --source-tag "<tag>"

We let the reader deploy it as (s)he needs it, but recommend you take a look at how we use systemd services and timers in the binary setup.sh script.

Additionally, we recommend you create a specific network to isolate the docker images from other adjacent services.

2 - Monitoring

What are the signals to capture once in production, and how to deal with them ?

Once in production, the chall-manager provides its functionalities to the end-users.

But production can suffer from a lot of disruptions: network latencies, interruption of services, an unexpected bug, chaos engineering going a bit too far… How can we monitor the chall-manager to make sure everything goes fine ? What to monitor to quickly understand what is going on ?

Metrics

A first approach to monitor what is going on inside the chall-manager is through its metrics.

NameTypeDescription
challengesint64The number of registered challenges.
instancesint64The number of registered instances.

You can use them to build dashboards, build KPI or anything else. They can be interesting for you to better understand the tendencies of usage of chall-manager through an event.

Tracing

A way to go deeper in understanding what is going on inside chall-manager is through tracing.

First of all, it will provide you information of latencies in the distributed locks system and Pulumi manipulations. Secondly, it will also provide you Service Performance Monitoring (SPM).

Using the OpenTelemetry Collector, you can configure it to produce RED metrics on the spans through the spanmetrics connector. When a Jaeger is bound to both the OpenTelemetry Collector and the Prometheus containing the metrics, you can monitor performances AND visualize what happens.

An example view of the Service Performance Monitoring in Jaeger, using the OpenTelemetry Collector and Prometheus server.

Through the use of those metrics and tracing capabilities, you could build alerts thresholds and automate responses or on-call alerts with the alertmanager.

A reference architecture to achieve this description follows.

graph TD
    subgraph Monitoring
        COLLECTOR["OpenTelemetry Collector"]
        PROM["Prometheus"]
        JAEGER["Jaeger"]
        ALERTMANAGER["AlertManager"]
        GRAFANA["Grafana"]

        COLLECTOR --> PROM
        JAEGER --> COLLECTOR
        JAEGER --> PROM
        ALERTMANAGER --> PROM
        GRAFANA --> PROM
    end

    subgraph Chall-Manager
        CM["Chall-Manager"]
        CMJ["Chall-Manager-Janitor"]
        ETCD["etcd cluster"]

        CMJ --> CM

        CM --> |OTLP| COLLECTOR
        CM --> ETCD
    end