OpenTelemetry in SLATE

In order to get better observability of the SLATE components, we added OpenTelemetry reporting into the SLATE infrastructure.

Background

OpenTelemetry is a collection of tools and SDKs that allow developers to collect information about their applications’ runtime metrics. OpenTelemetry collects information about duration required to handle user requests and about error rates when processing requests. OpenTelemetry can correlate and combine this information across multiple services in order to generate a unified view of user interactions.

We instrumented the SLATE services with OpenTelemetry in order to better track user interactions and to help with debugging errors that may occur.

OpenTelemetry

OpenTelemetry is an observability framework that encompasses a variety of tools and SDKs that combine to let a user’s end-to-end interactions be recorded even if the interaction spans multiple services such as web portals, API services, and database calls.

Generally, OpenTelemetry is used by instrumenting applications to collect telemetry data (traces, metrics, logs). This information is then sent to a collector that stores the data in a time series database. Finally, there is a frontend that presents information to administrators.

Traces and spans

Traces are the primary metric that OpenTelemetry is concerned about. Traces are intended to record all activity that occurs in a single user interaction. Traces are usually broken down into atomic units of work called spans. A span might consist of something like a SQL query run against a database, a call to a microservice, or an operation on a storage device.

Traces are usually generated by a Trace Provider that is integrated into a service as an SDK. There are providers for multiple languages such as Python, C++, Javascript, Java, etc. In SLATE, we use the C++ and Python providers.

Trace providers will:

  1. Aggregate spans created in an application
  2. Bundle the aggregate spans into traces
  3. Send them to a collector for further processing

Collectors

OpenTelemetry uses collectors to receive and process traces from trace providers. A collector provides a centralized location for collecting traces from multiple trace providers and combines related traces so that interactions across different services and sources can be associated with a single user interaction.

Collectors can also apply additional processing to traces it receives and store processed traces in persistent storage (usually a time series database like Clickhouse).

Signoz

We chose to use Signoz to handle the duties of storing and presenting traces. Signoz is an open source platform for presenting OpenTelemetry data and provides:

  • A Helm chart for deploying an OpenTelemetry collector and Clickhouse database pair to store traces, metrics, and logs.
  • Alerting and monitoring of services based on traces received.

Instrumenting SLATE

C++

The SLATE API server is written in C++. In order to instrument this component, we incorporated the OpenTelemetry C++ client to generate and send traces. Although the process was a bit tedious, it was relatively straightforward.

The core of the OpenTelemetry code is located in Telemetry.cpp. The initializeTracer function is called when the server starts up. It takes the configuration settings for the server and initializes a trace provider for the API server with the appropriate collector, sampling parameters, and other settings.

Within each function involved in handling incoming API calls, the code obtains the trace provider using getTracer. This gets a shared_ptr to a tracer object that can generate spans associated with handling an incoming API call.

  1. If the function is directly handling an incoming API request (e.g. the web framework routes incoming HTTP requests to this function), it will then use setWebSpanAttributes and getWebSpanOptions to get attributes and options for the span.
  2. These options and attributes are then passed to the StartSpan method of the tracer in order to create a new span that will cover the work done by this function.
  3. populateSpan is called right after the span is generated to add various information (client IP, HTTP method, etc.) about the incoming HTTP request to the span.
  4. If an error occurs within the function, setWebSpanError is used to populate the span with error information to aid in debugging.
  5. Finally, the span’s End() method is used to close the span and send it to the OpenTelemetry collector.

If the function that is being run is not directly processing an incoming API call, it does something slightly different to generate a span.

  1. The getTracer function is still called to get a shared_ptr to a tracer object.
  2. setInternalSpanAttributes and getInternalSpanOptions are used to get the options and attributes for the span.
  3. These are then used when creating a new span using the StartSpan method of the tracer object.
  4. If an error occurs during the function call, setSpanError is used to set the appropriate fields in the span to aid in debugging.
  5. Finally, the span’s End() method is called just before the function exits.

Python

The SLATE Portal uses Python and Flask to provide a web interface for SLATE. Unlike with C++, OpenTelemetry provides a method to auto-instrument Python and Flask code so traces are automatically generated. This is achieved by deploying an instrumentation CRD with the SLATE Portal Kubernetes pods.

  1. We apply a CRD that sets the trace endpoints as well as the auto-instrumentation that should be used.

    apiVersion: opentelemetry.io/v1alpha1
    kind: Instrumentation
    metadata:
      name: slate-instrumentation
    spec:
      exporter:
        endpoint: http://injection-collector-collector.development.svc.cluster.local:4318
      propagators:
        - tracecontext
        - baggage
        - b3
      python:
        image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
    
  2. We update the labels for the Kubernetes pod that runs the Portal with annotations that indicate that an OpenTelemetry sidecar should be deployed.

    metadata:
      annotations:
        sidecar.opentelemetry.io/inject: "injection-collector"
        instrumentation.opentelemetry.io/inject-python: "true"
    
  3. A CRD is used to automatically deploy a collector in the same Kubernetes namespace as the Portal pods. This collector is used to collect traces from the Portal and forward them to a central collector.

    apiVersion: opentelemetry.io/v1alpha1
    kind: OpenTelemetryCollector
    metadata:
      name: injection-collector
    spec:
      config: |
        receivers:
          otlp:
            protocols:
              grpc:
              http:
        processors:
          memory_limiter:
            check_interval: 1s
            limit_percentage: 75
            spike_limit_percentage: 15
          batch:
            send_batch_size: 10000
            timeout: 10s
        exporters:
          logging:
          otlphttp:
            endpoint: opentel.collector.dns
        service:
          pipelines:
            traces:
              receivers: [otlp]
              processors: []
              exporters: [logging, otlphttp]
    

    Note: Vertically scroll in the code-block above to view the entire YAML expression.

Observability

Using Signoz, we can:

  • Examine operations on the SLATE Portal or within the SLATE API server.
  • Search for interactions based on search criteria like username, the cluster being worked on, errors, HTTP codes (e.g. 200, 500, 403, etc.) as well as time taken to handle API calls.

In short, Signoz allows us to find anomalous calls taking more time than usual or to find API calls that result in elevated error rates (e.g. due to a problem with a SLATE cluster).

Adding monitoring

Signoz also allows us to automatically send alerts to team Slack channels when incoming traces indicate that certain API calls result in elevated error rates or require significantly more time than usual to process a request. This allows us to proactively investigate potential issues.

Conclusion

Although adding OpenTelemetry to the SLATE infrastructure required large changes to our codebase and to our infrastructure, the resulting improvements in observability and debugging drastically improved our ability to monitor and respond to problems within the SLATE infrastructure.

Suchandra Thapa