Build Telemetry for Distributed Services之OpenTracing简介

What is Distributed Tracing?

Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.

Who Uses Distributed Tracing?

IT and DevOps teams can use distributed tracing to monitor applications. Distributed tracing is particularly well-suited to debugging and monitoring modern distributed software architectures, such as microservices.

Developers can use distributed tracing to help debug and optimize their code.

What is OpenTracing?

It is probably easier to start with what OpenTracing is NOT.

OpenTracing is not a download or a program. Distributed tracing requires that software developers add instrumentation to the code of an application, or to the frameworks used in the application.
OpenTracing is not a standard. The Cloud Native Computing Foundation (CNCF) is not an official standards body. The OpenTracing API project is working towards creating more standardized APIs and instrumentation for distributed tracing.

OpenTracing is comprised of an API specification, frameworks and libraries that have implemented the specification, and documentation for the project. OpenTracing allows developers to add instrumentation to their application code using APIs that do not lock them into any one particular product or vendor.

For more information about where OpenTracing has already been implemented, see the list of languages and the list of tracers that support the OpenTracing specification.

Concepts and Terminology

All language-specific OpenTracing APIs share some core concepts and terminology. These concepts are so central and important to the project that they have their own repository (github.com/opentracing/specification) and semver scheme.

The OpenTracing Semantic Specification is a versioned description of the current pan-language OpenTracing standard
The Semantic Conventions spec describes conventional Span tags and log keys for common semantic scenarios

Both files are versioned and the GitHub repository is tagged according to the rules described by the versioning policy.

Spans

What is a Span?
- Tags
- Logs
- SpanContext
- Example Span:

What is a Span?

The “span” is the primary building block of a distributed trace, representing an individual unit of work done in a distributed system.

Each component of the distributed system contributes a span - a named, timed operation representing a piece of the workflow.

Spans can (and generally do) contain “References” to other spans, which allows multiple Spans to be assembled into one complete Trace - a visualization of the life of a request as it moves through a distributed system.

Each span encapsulates the following state according to the OpenTracing specification:

An operation name
A start timestamp and finish timestamp
A set of key:value span Tags
A set of key:value span Logs
A SpanContext

Logs

Logs are key:value pairs that are useful for capturing span-specific logging messages and other debugging or informational output from the application itself. Logs may be useful for documenting a specific moment or event within the span (in contrast to tags which should apply to the span as a whole).

SpanContext

The SpanContext carries data across process boundaries. Specifically, it has two major components:

An implementation-dependent state to refer to the distinct span within a trace
- i.e., the implementing Tracer’s definition of spanID and traceID
Any Baggage Items
- These are key:value pairs that cross process-boundaries.
- These may be useful to have some data available for access throughout the trace.

Example Span:

    t=0            operation name: db_query               t=x

     +-----------------------------------------------------+
     | · · · · · · · · · ·    Span     · · · · · · · · · · |
     +-----------------------------------------------------+

Tags:
- db.instance:"jdbc:mysql://127.0.0.1:3306/customers
- db.statement: "SELECT * FROM mytable WHERE foo='bar';"

Logs:
- message:"Can't connect to mysql server on '127.0.0.1'(10061)"

SpanContext:
- trace_id:"abc123"
- span_id:"xyz789"
- Baggage Items:
  - special_id:"vsid1738"

Scopes and Threading

Introduction

In any given thread there is an “active” span primarily responsible for the work accomplished by the surrounding application code, called the ActiveSpan. The OpenTracing API allows for only one span in a thread to be active at any point in time. This is managed using a Scope, which formalizes the activation and deactivation of a Span.

Other spans that are involved with the same thread will satisfy either of the following conditions:

Started
Not finished
Not “active”

For example, there can be multiple spans on the same thread, if the spans are:

Waiting for I/O
Blocked on a child Span
Off of the critical path

Note that if a Scope exists when the developer creates a new Span then it will act as its parent, unless the programmer invokes ignoreActiveSpan()at buildSpan() time or specifies parent context explicitly.

Accessing the Current Active Span

As it is inconvenient to pass an active Span from function to function manually, so OpenTracing requires that every Tracer contain a ScopeManager. The ScopeManager API grants access to the active Span through a Scope. This means that a developer can access any active Span through a Scope.

Moving a span between threads

Using the ScopeManager API, a developer can transfer the spans among different threads. A Span’s lifetime might start in one thread and end in another. The ScopeManager API allows for a Span to be transferred to another thread or callback. Passing of scopes to another thread or callback is not supported. For more details, refer to the language specific documentation

Tags, logs and baggage

Logs

Logs are key:value pairs that are useful for capturing timed log messages and other debugging or informational output from the application itself. Logs may be useful for documenting a specific moment or event within the span (in contrast to tags which should apply to the span regardless of time).

Baggage Items

The SpanContext carries data across process boundaries. Specifically, it has two major components:

An implementation-dependent state to refer to the distinct span within a trace
- i.e., the implementing Tracer’s definition of spanID and traceID
Any Baggage Items
- These are key:value pairs that cross process-boundaries.
- These may be useful to have some data available for access throughout the trace.

Tracers

Introduction

OpenTracing provides an open, vendor-neutral standard API for describing distributed transactions, specifically causality, semantics and timing. It provides a general purpose distributed context propagation framework, consisting of API primitives for:

passing the metadata context in-process
encoding and decoding the metadata context for transmitting it over the network for inter-process communications
causality tracking: parent-child, forks, joins

OpenTracing abstracts away the differences among numerous tracer implementations. This means that instrumentation would remain the same irrespective of the tracer system being used by the developer. In order to instrument an application using OpenTracing specification, a compatible OpenTracing tracer must be deployed. A list of the all the supported tracers is available here.

Tracer Interface

The Tracer interface creates Spans and understands how to Inject (serialize) and Extract (deserialize) their metadata across process boundaries. It has the following capabilities:

Start a new Span
Inject a SpanContext into a carrier
Extract a SpanContext from a carrier

Each of these will be discussed in more detail below. For implementation purposes, check out the specific language guide.

Setting up a Tracer

A Tracer is the actual implementation that will record the Spans and publish them somewhere. How an application handles the actual Tracer is up to the developer: either consume it directly throughout the application or store it in the GlobalTracer for easier usage with instrumented frameworks.

Different Tracer implementations vary in how and what parameters they receive at initialization time, such as:

Component name for this application’s traces.
Tracing endpoint.
Tracing credentials.
Sampling strategy.

Once a Tracer instance is obtained, it can be used to manually create Span, or pass it to existing instrumentation for frameworks and libraries.

In order to not force the user to keep around a Tracer, the io.opentracing.util artifact includes a helper GlobalTracer class implementing the io.opentracing.Tracer interface, which, as the name implies, acts as as a global instance that can be used from anywhere. It works by forwarding all operations to another underlying Tracer, that will get registered at some future point.

By default, the underlying Tracer is a no-nop implementation.

Starting a new Trace

A new trace is started whenever a new Span is created without references to a parent Span. When creating a new Span, you need to specify an “operation name”, which is a free-format string that you can use to help you identify the code this Span relates to. The next Span from our new trace will probably be a child Span and can be seen as a representation of a sub-routine that is executed “within” the main Span. This child Span has, therefore, a ChildOfrelationship with the parent. Another type of relationship is the FollowsFromand is used in special cases where the new Span is independent of the parent Span, such as in asynchronous processes.

Accessing the Active Span

Tracer can be used for enabling access to the ActiveSpan. ActiveSpans can also be accessed through a scopeManager in some languages. Refer to the specific language guide for more implementation details.

Propagating a Trace with Inject/Extract

In order to trace across process boundaries in distributed systems, services need to be able to continue the trace injected by the client that sent each request. OpenTracing allows this to happen by providing inject and extract methods that encode a span’s context into a carrier. The inject method allows for the SpanContext to be passed on to a carrier. For example, passing the trace information into the client’s request so that the server you send it to can continue the trace. The extract method does the exact opposite. It extract the SpanContext from the carrier. For example, if there was an active request on the client side, the developer must extract the SpanContext using the io.opentracing.Tracer.extract method.

Tracing Systems

The following table lists all currently known OpenTracing Tracers:

Tracing system	Supported languages
CNCF Jaeger	Java, Go, Python, Node.js, C++, C#
Datadog	Go
inspectIT	Java
Instana	Crystal, Go, Java, Node.js, Python, Ruby
LightStep	Go, Python, JavaScript, Objective-C, Java, PHP, Ruby,C++
stagemonitor	Java

Inject and extract

Programmers adding tracing support across process boundaries must understand the Tracer.Inject(...) and Tracer.Extract(...) capabilities of the OpenTracing specification. They are conceptually powerful, allowing the programmer to write correct and general cross-process propagation code without being bound to a particular OpenTracing implementation; that said, with great power comes great opportunity for confusion. :)

This document provides a concise summary of the design and proper use of Inject and Extract, regardless of the particular OpenTracing language or OpenTracing implementation.

“The Big Picture” for explicit trace propagation

The hardest thing about distributed tracing is the distributed part. Any tracing system needs a way of understanding the causal relationship between activities in many distinct processes, whether they be connected via formal RPC frameworks, pub-sub systems, generic message queues, direct HTTP calls, best-effort UDP packets, or something else entirely.

Some distributed tracing systems (e.g., Project5 from 2003, or WAP5 from 2006 or The Mystery Machine from 2014) infer causal relationships across process boundaries. Of course there is a tradeoff between the apparent convenience of these black-box inference-based approaches and the freshness and quality of the assembled traces. Per the concern about quality, OpenTracing is an explicit distributed tracing instrumentation standard, and as such it is much better-aligned with approaches like X-Trace from 2007, Dapper from 2010, or numerous open-source tracing systems like Zipkin or Jaeger (among others).

Together, Inject and Extract allow for inter-process trace propagation without tightly coupling the programmer to a particular OpenTracing implementation.

Requirements for the OpenTracing propagation scheme

For Inject and Extract to be useful, all of the following must be true:

Per the above, the OpenTracing user handling cross-process trace propagation must not need to write OpenTracing-implementation-specific code
OpenTracing implementations must not need special handlers for every known inter-process communication mechanism: that’s far too much work, and it’s not even well-defined
That said, the propagation mechanism should be extensible for optimizations

The basic approach: Inject, Extract, and Carriers

Any SpanContext in a trace may be Injected into what OpenTracing refers to as a Carrier. A Carrier is an interface or data structure that’s useful for inter-process communication (IPC); that is, the Carrier is something that “carries” the tracing state from one process to another. The OpenTracing specification includes two required Carrier formats, though custom Carrier formats are possible as well.

Similarly, given a Carrier, an injected trace may be Extracted, yielding a SpanContext instance which is semantically identical to the one Injected into the Carrier.

Inject pseudocode example

span_context = ...
outbound_request = ...

# We'll use the (builtin) HTTP_HEADERS carrier format. We
# start by using an empty map as the carrier prior to the
# call to `tracer.inject`.
carrier = {}
tracer.inject(span_context, opentracing.Format.HTTP_HEADERS, carrier)

# `carrier` now contains (opaque) key:value pairs which we pass
# along over whatever wire protocol we already use.
for key, value in carrier:
    outbound_request.headers[key] = escape(value)

Extract pseudocode example

inbound_request = ...

# We'll again use the (builtin) HTTP_HEADERS carrier format. Per the
# HTTP_HEADERS documentation, we can use a map that has extraneous data
# in it and let the OpenTracing implementation look for the subset
# of key:value pairs it needs.
#
# As such, we directly use the key:value `inbound_request.headers`
# map as the carrier.
carrier = inbound_request.headers
span_context = tracer.extract(opentracing.Format.HTTP_HEADERS, carrier)
# Continue the trace given span_context. E.g.,
span = tracer.start_span("...", child_of=span_context)

# (If `carrier` held trace data, `span` will now be ready to use.)

Carriers have formats

All Carriers have a format. In some OpenTracing languages, the format must be specified explicitly as a constant or string; in others, the format is inferred from the Carrier’s static type information.

Required Inject/Extract Carrier formats

At a minimum, all platforms require OpenTracing implementations to support two Carrier formats: the “text map” format and the “binary” format.

The text map Carrier format is a platform-idiomatic map from (unicode) string to string
The binary Carrier format is an opaque byte array (and presumably more compact and efficient)

What the OpenTracing implementations choose to store in these Carriers is not formally defined by the OpenTracing specification, but the presumption is that they find a way to encode “tracer state” about the propagated SpanContext(e.g., in Dapper this would include a trace_id, a span_id, and a bitmask representing the sampling status for the given trace) as well as any key:value Baggage items.

Interoperability of OpenTracing implementations across process boundaries

There is no expectation that different OpenTracing implementations Injectand Extract SpanContexts in compatible ways. Though OpenTracing is agnostic about the tracing implementation across an entire distributed system, for successful inter-process handoff it’s essential that the processes on both sides of a propagation use the same tracing implementation.

Custom Inject/Extract Carrier formats

Any propagation subsystem (an RPC library, a message queue, etc) may choose to introduce their own custom Inject/Extract Carrier format; by preferring their custom format but falling back to a required OpenTracing format as needed they allow OpenTracing implementations to optimize for their custom format without needing OpenTracing implementations to support their format.

Some pseudocode will make this less abstract. Imagine that we’re the author of the (sadly fictitious) ArrrPC pirate RPC subsystem, and we want to add OpenTracing support to our outbound RPC requests. Minus some error handling, our pseudocode might look like this:

span_context = ...
outbound_request = ...

# First we try our custom Carrier, the outbound_request itself.
# If the underlying OpenTracing implementation cares to support
# it, this call is presumably more efficient in this process
# and over the wire. But, since this is a non-required format,
# we must also account for the possibility that the OpenTracing
# implementation does not support arrrpc.ARRRPC_OT_CARRIER.
try:
    tracer.inject(span_context, arrrpc.ARRRPC_OT_CARRIER, outbound_request)

except opentracing.UnsupportedFormatException:
    # If unsupported, fall back on a required OpenTracing format.
    carrier = {}
    tracer.inject(span_context, opentracing.Format.HTTP_HEADERS, carrier)
    # `carrier` now contains (opaque) key:value pairs which we
    # pass along over whatever wire protocol we already use.
    for key, value in carrier:
	outbound_request.headers[key] = escape(value)

More about custom Carrier formats

The precise representation of the “Carrier formats” may vary from platform to platform, but in all cases they should be drawn from a global namespace. Support for a new custom carrier format must not necessitate changes to the core OpenTracing platform APIs, though each OpenTracing platform API must define the required OpenTracing carrier formats (e.g., string maps and binary blobs). For example, if the maintainer of ArrrPC RPC framework wanted to define an “ArrrPC” Inject/Extract format, she or he must be able to do so without sending a PR to OpenTracing maintainers (though of course OpenTracing implementations are not required to support the “ArrrPC” format). There is an end-to-end injector and extractor example below to make this more concrete.

An end-to-end Inject and Extract propagation example

To make the above more concrete, consider the following sequence:

A client process has a SpanContext instance and is about to make an RPC over a home-grown HTTP protocol
That client process calls Tracer.Inject(...), passing the active SpanContextinstance, a format identifier for a text map, and a text map Carrier as parameters
Inject has populated the text map in the Carrier; the client application encodes that map within its homegrown HTTP protocol (e.g., as headers)
The HTTP request happens and the data crosses process boundaries…
Now in the server process, the application code decodes the text map from the homegrown HTTP protocol and uses it to initialize a text map Carrier
The server process calls Tracer.Extract(...), passing in the desired operation name, a format identifier for a text map, and the Carrier from above
In the absence of data corruption or other errors, the server now has a SpanContextinstance that belongs to the same trace as the one in the client

Other examples can be found in the OpenTracing use cases doc.