OpenTelemetry in SLATE
In order to get better observability of the SLATE components, we added OpenTelemetry reporting into the SLATE infrastructure.
OpenTelemetry is a collection of tools and SDKs that allow developers to collect information about their applications’ runtime metrics. OpenTelemetry collects information about duration required to handle user requests and about error rates when processing requests. OpenTelemetry can correlate and combine this information across multiple services in order to generate a unified view of user interactions.
We instrumented the SLATE services with OpenTelemetry in order to better track user interactions and to help with debugging errors that may occur.
OpenTelemetry is an observability framework that encompasses a variety of tools and SDKs that combine to let a user’s end-to-end interactions be recorded even if the interaction spans multiple services such as web portals, API services, and database calls.
Generally, OpenTelemetry is used by instrumenting applications to collect telemetry data (traces, metrics, logs). This information is then sent to a collector that stores the data in a time series database. Finally, there is a frontend that presents information to administrators.
Traces and spans
Traces are the primary metric that OpenTelemetry is concerned about. Traces are intended to record all activity that occurs in a single user interaction. Traces are usually broken down into atomic units of work called spans. A span might consist of something like a SQL query run against a database, a call to a microservice, or an operation on a storage device.
Trace providers will:
- Aggregate spans created in an application
- Bundle the aggregate spans into traces
- Send them to a collector for further processing
OpenTelemetry uses collectors to receive and process traces from trace providers. A collector provides a centralized location for collecting traces from multiple trace providers and combines related traces so that interactions across different services and sources can be associated with a single user interaction.
Collectors can also apply additional processing to traces it receives and store processed traces in persistent storage (usually a time series database like Clickhouse).
We chose to use Signoz to handle the duties of storing and presenting traces. Signoz is an open source platform for presenting OpenTelemetry data and provides:
- A Helm chart for deploying an OpenTelemetry collector and Clickhouse database pair to store traces, metrics, and logs.
- Alerting and monitoring of services based on traces received.
The SLATE API server is written in C++. In order to instrument this component, we incorporated the OpenTelemetry C++ client to generate and send traces. Although the process was a bit tedious, it was relatively straightforward.
The core of the OpenTelemetry code is located in Telemetry.cpp. The
initializeTracer function is called when the server starts up. It takes the configuration settings for the server and initializes a trace provider for the API server with the appropriate collector, sampling parameters, and other settings.
Within each function involved in handling incoming API calls, the code obtains the trace provider using getTracer. This gets a
shared_ptr to a tracer object that can generate spans associated with handling an incoming API call.
- If the function is directly handling an incoming API request (e.g. the web framework routes incoming HTTP requests to this function), it will then use setWebSpanAttributes and getWebSpanOptions to get attributes and options for the span.
- These options and attributes are then passed to the
StartSpanmethod of the tracer in order to create a new span that will cover the work done by this function.
- populateSpan is called right after the span is generated to add various information (client IP, HTTP method, etc.) about the incoming HTTP request to the span.
- If an error occurs within the function, setWebSpanError is used to populate the span with error information to aid in debugging.
- Finally, the span’s
End()method is used to close the span and send it to the OpenTelemetry collector.
If the function that is being run is not directly processing an incoming API call, it does something slightly different to generate a span.
- The getTracer function is still called to get a
shared_ptrto a tracer object.
- setInternalSpanAttributes and getInternalSpanOptions are used to get the options and attributes for the span.
- These are then used when creating a new span using the
StartSpanmethod of the tracer object.
- If an error occurs during the function call, setSpanError is used to set the appropriate fields in the span to aid in debugging.
- Finally, the span’s
End()method is called just before the function exits.
The SLATE Portal uses Python and Flask to provide a web interface for SLATE. Unlike with C++, OpenTelemetry provides a method to auto-instrument Python and Flask code so traces are automatically generated. This is achieved by deploying an instrumentation CRD with the SLATE Portal Kubernetes pods.
We apply a CRD that sets the trace endpoints as well as the auto-instrumentation that should be used.
apiVersion: opentelemetry.io/v1alpha1 kind: Instrumentation metadata: name: slate-instrumentation spec: exporter: endpoint: http://injection-collector-collector.development.svc.cluster.local:4318 propagators: - tracecontext - baggage - b3 python: image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
We update the labels for the Kubernetes pod that runs the Portal with annotations that indicate that an OpenTelemetry sidecar should be deployed.
metadata: annotations: sidecar.opentelemetry.io/inject: "injection-collector" instrumentation.opentelemetry.io/inject-python: "true"
A CRD is used to automatically deploy a collector in the same Kubernetes namespace as the Portal pods. This collector is used to collect traces from the Portal and forward them to a central collector.
apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: injection-collector spec: config: | receivers: otlp: protocols: grpc: http: processors: memory_limiter: check_interval: 1s limit_percentage: 75 spike_limit_percentage: 15 batch: send_batch_size: 10000 timeout: 10s exporters: logging: otlphttp: endpoint: opentel.collector.dns service: pipelines: traces: receivers: [otlp] processors:  exporters: [logging, otlphttp]
Note: Vertically scroll in the code-block above to view the entire YAML expression.
Using Signoz, we can:
- Examine operations on the SLATE Portal or within the SLATE API server.
- Search for interactions based on search criteria like username, the cluster being worked on, errors, HTTP codes (e.g.
403, etc.) as well as time taken to handle API calls.
In short, Signoz allows us to find anomalous calls taking more time than usual or to find API calls that result in elevated error rates (e.g. due to a problem with a SLATE cluster).
Signoz also allows us to automatically send alerts to team Slack channels when incoming traces indicate that certain API calls result in elevated error rates or require significantly more time than usual to process a request. This allows us to proactively investigate potential issues.
Although adding OpenTelemetry to the SLATE infrastructure required large changes to our codebase and to our infrastructure, the resulting improvements in observability and debugging drastically improved our ability to monitor and respond to problems within the SLATE infrastructure.