Build Telemetry for Distributed Services之Elastic APM

官网地址：https://www.elastic.co/guide/en/apm/get-started/current/index.html

Overview

Elastic APM is an application performance monitoring system built on the Elastic Stack. It allows you to monitor software services and applications in real time — collect detailed performance information on response time for incoming requests, database queries, calls to caches, external HTTP requests, and more. This makes it easy to pinpoint and fix performance problems quickly.

Elastic APM also automatically collects unhandled errors and exceptions. Errors are grouped based primarily on the stacktrace, so you can identify new errors as they appear and keep an eye on how many times specific errors happen.

Metrics are another important source of information when debugging production systems. Elastic APM agents automatically pick up basic host-level metrics and agent specific metrics, like JVM metrics in the Java Agent, and Go runtime metrics in the Go Agent.

Components and documentation

Elastic APM consists of four components: APM Agents, APM Server, Elasticsearch, and Kibana.

APM Agents

APM agents are open source libraries written in the same language as your service. You may only need one, or you might use all of them. You install them into your service as you would install any other library. They instrument your code and collect performance data and errors at runtime. This data is buffered for a short period and sent on to APM Server.

Each agent has its own documentation:

APM Server

APM Server is an open source application that receives performance data from your APM agents. It’s a separate component by design, which helps keep the agents light, prevents certain security risks, and improves compatibility across the Elastic Stack.

After the APM Server has validated and processed events from the APM agents, the server transforms the data into Elasticsearch documents and stores them in corresponding Elasticsearch indices. In a matter of seconds you can start viewing your application performance data in the Kibana APM UI.

The APM Server reference provides everything you need when it comes to working with the server. Here you can learn about installation, configuration, security, monitoring, and more.

Elasticsearch

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze large volumes of data quickly and in near real time. Elasticsearch is used to store APM performance metrics and make use of its aggregations.

APM Kibana UI

Kibana is an open source analytics and visualization platform designed to work with Elasticsearch. You use Kibana to search, view, and interact with data stored in Elasticsearch.

Since application performance monitoring is all about visualizing data and detecting bottlenecks, it’s crucial you understand how to use the Kibana APM UI. The following sections will help you get started:

APM also has built-in integrations with Machine Learning. To learn more about this feature, refer to the Kibana UI documentation for Machine learning integration.

Visualizing Application Bottlenecks

Elastic APM captures different types of information from within instrumented applications:

Spans contain information about a specific code path that has been executed. They measure from the start to end of an activity, and they can have a parent/child relationship with other spans.
Transactions are a special kind of span that have extra metadata associated with them. You can think of transactions as the highest level of work you’re measuring within a service. As an example, a transaction could be a request to your server, a batch job, or a custom transaction type.
Errors contain information about the original exception that occurred or about a log created when the exception occurred.

Each of these information types have a specific page associated with them in the APM UI. These various pages display the captured data in curated charts and tables that allow you to easily compare and debug your applications

For example, you can see information about response times, requests per minute, and status codes per endpoint. You can even dive into a specific request sample and get a complete waterfall view of what your application is spending its time on. You might see that your bottlenecks are in database queries, cache calls, or external requests. For each incoming request and each application error, you can also see contextual information such as the request header, user information, system values, or custom data that you manually attached to the request.

Having access to application-level insights with just a few clicks can drastically decrease the time you spend debugging errors, slow response times, and crashes.

Using APM

APM is designed to be as intuitive as possible, but you might come across certain terms or concepts that don’t feel native to you. Not to worry, we’ve created this guide to help you get the most out of Elastic APM.

APM is available via the navigation sidebar in Kibana.

Services overview

The Services overview gives you quick insights into the health and general performance of each service.

You can add services by setting the service.name configuration in each of the APM agents you’re instrumenting.

Traces overview

The Traces overview displays the entry transaction for all traces in your application. If you’re using Distributed tracing, this view is key to finding the critical paths within your application. Transactions with the same name are grouped together and only shown once in this table.

By default, transactions are sorted by Impact. Impact helps show the most used and slowest endpoints in your service - in other words, it’s the collective amount of pain a specific endpoint is causing your users. If there’s a particular endpoint you’re worried about, you can click on it to view the transaction details.

Distributed tracing

Elastic APM supports distributed tracing. Distributed tracing is a key feature of modern application performance monitoring as application architectures are shifting from monolithic to more distributed, service-based architectures.

Distributed tracing allows APM users to automatically trace requests all the way through the service architecture, and visualize those traces in one single view in the APM UI. This is accomplished by tracing all of the requests, from the initial web request to your front-end service, to queries made to your back-end services. This makes finding possible bottlenecks throughout your application much easier and faster.

By definition, a distributed trace includes more than one transaction. You can use the span timeline visualization to view a waterfall display of all of the transactions from individual services that are connected in a trace.

Distributed tracing is supported by all APM agents and there’s no additional configuration needed.

Transaction overview

A transaction describes an event captured by an Elastic APM agent instrumenting a service. The APM agents automatically collect performance metrics on HTTP requests, database queries, and much more.

Selecting a service brings you to the transactions overview. The time spent by span type, transaction duration and requests per minutechart display information on all transactions associated with the selected service. The Transactions table, however, provides only a list of transaction groups for the selected service. In other words, this view groups all transactions of the same name together, and only displays one transaction for each group.

Time spent by span type — [beta] This functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.Certain agents support breakdown graphs in the APM UI. This graph is an easy way to visualize where your application is spending most of its time. For example, is your app spending time in external calls, database processing, or application code execution?

The time a transaction took to complete is also recorded and displayed on the chart under the "app" label. "app" indicates that something was happening within the application, but we’re not sure exactly what. This could be a sign that the agent does not have auto-instrumentation for whatever was happening during that time.

It’s important to note that if you have asynchronous spans, the sum of all span times may exceed the duration of the transaction.

If the Time spent by span type chart is missing in the APM UI, it means your agent does not support this feature yet.

Transaction duration shows the response times for this service and is broken down into average, 95th, and 99th percentile. If there’s a weird spike that you’d like to investigate, you can simply zoom in on the graph - this will adjust the specific time range, and all of the data on the page will update accordingly.

Requests per minute is divided into response codes: 2xx, 3xx, 4xx, etc., and is useful for determining if you’re serving more of one code than you typically do. Like in the Transaction duration graph, you can zoom in on anomalies to further investigate them.

The Transactions table is similar to the traces overview and shows the name of each transaction occurring in the selected service. Transactions with the same name are grouped together and only shown once in this table. By default, transaction groups are sorted by Impact. Impact helps show the most used and slowest endpoints in your service - in other words, it’s the collective amount of pain a specific endpoint is causing your users. If there’s a particular endpoint you’re worried about, you can click on it to view the transaction details.

The transaction overview will only display helpful information when the transactions in your service are named correctly.

Elastic APM Agents come with built-in support for popular frameworks out-of-the-box. However, if you only see one route in the Transaction overview page, or if you have transactions named "unknown route", it could be a symptom that the agent either wasn’t installed correctly or doesn’t support your framework.

For further details, including troubleshooting and custom implementation instructions, refer to the documentation for each APM Agent you’ve implemented.

Transaction details

Selecting a transaction group will bring you to the transaction details. Transaction details include a high-level overview of the time spent by span type, transaction group duration, requests per minute, and transaction group duration distribution. It’s important to note that all of these graphs show data from every transaction within the selected transaction group

A single sampled transaction is also displayed. This sampled transaction is based on your selection in the Transactions duration distribution. You can update the sampled transaction by selecting a new bucket in the transactions duration distribution graph. The number of requests per bucket is displayed when hovering over the graph, and the selected bucket is highlighted to stand out.

For a particular transaction sample, we can get even more information in the metadata tab:

Labels - Custom labels added by agents
HTTP request/response information
Host information
Container information
Service - The service/application runtime, agent, name, etc..
Process - The process id that served up the request.
Agent information
URL
User - Requires additional configuration, but allows you to see which user experienced the current transaction.
Custom - You can configure your agent to add custom contextual information on transactions.

All of this data is stored in documents in Elasticsearch. This means you can select "Actions - View sample document" to see the actual Elasticsearch document under the discover tab.

Span timeline

A span is defined as the duration of a single event. Spans are automatically captured by APM agents, and you can also define custom spans. Each span has a type and is defined by a different color in the timeline/waterfall visualization.

The span timeline visualization is a bird’s-eye view of what your application was doing while it was trying to respond to the request that came in. This makes it useful for visualizing where the selected transaction spent most of its time.

View a span in detail by clicking on it in the timeline waterfall. For example, in the below screenshot we’ve clicked on an SQL Select database query. The information displayed includes the actual SQL that was executed, how long it took, and the percentage of the trace’s total time. You also get a stack trace, which shows the SQL query in your code. Finally, APM knows which files are your code and which are just modules or libraries that you’ve installed. These library frames will be minimized by default in order to show you the most relevant stack trace.

If your span timeline is colorful, it’s indicative of a distributed trace. Services in a distributed trace are separated by color and listed in the order they occur.

Don’t forget, a distributed trace includes more than one transaction. When viewing these distributed traces in the timeline waterfall, you’ll see this icon, which indicates the next transaction in the trace. These transactions can be expanded and viewed in detail by clicking on them.

After exploring these traces, you can return to the full trace by clicking View full trace in the upper right hand corner of the page

Metrics overview

The Metrics overview provides agent-specific metrics, which lets you perform more in-depth root cause analysis investigations within the APM UI.

If you’re experiencing a problem with your service, you can use this page to attempt to find the underlying cause. For example, you might be able to correlate a high number of errors with a long transaction duration, high CPU usage, or a memory leak.

Machine Learning integration

The Machine Learning integration will initiate a new job predefined to calculate anomaly scores on transaction response times. The response time graph will show the expected bounds and annotate the graph when the anomaly score is 75 or above.

Jobs can be created per transaction type and based on the average response time. You can manage jobs in the Machine Learning jobs management. It might take some time for results to appear on the graph.

Machine learning is a platinum feature. For a comparison of the Elastic license levels, see the subscription page.

Data Model

Elastic APM agents capture different types of information from within their instrumented applications. These are known as events, and can be spans, transactions, errors, or metrics.

Events can contain additional metadata which further enriches your data.

Spans

Spans contain information about a specific code path that has been executed. They measure from the start to end of an activity, and they can have a parent/child relationship with other spans.

Agents automatically instrument a variety of libraries to capture these spans from within your application. In addition, you can use the Agent API for ad hoc instrumentation of specific code paths.

A span contains:

A transaction.id attribute that refers to their parent transaction.
A parent.id attribute that refers to their parent span, or their transaction.
start time and duration
name
type
stack trace (optional)

Most agents limit keyword fields (e.g. span.id) to 1024 characters, and non-keyword fields (e.g. span.start.us) to 10,000 characters.

Metrics

APM agents automatically pick up basic host-level metrics, including system and process-level CPU and memory metrics. Agent specific metrics are also available, like JVM metrics in the Java Agent, and Go runtime metrics in the Go Agent.

Infrastructure and application metrics are important sources of information when debugging production systems, which is why we’ve made it easy to filter metrics for specific hosts or containers in the Kibana metrics overview.

Metrics have the processor.event property set to metric.

Metrics are stored in metric indices.

For a full list of tracked metrics, see the relevant agent documentation:

Transactions

Transactions are a special kind of span that have additional attributes associated with them. They describe an event captured by an Elastic APM agent instrumenting a service. You can think of transactions as the highest level of work you’re measuring within a service. As an example, a transaction might be a:

Request to your server
Batch job
Background job
Custom transaction type

Agents decide whether to sample transactions or not, and provide settings to control sampling behavior. If sampled, the spans of a transaction are sent and stored as separate documents. Within one transaction there can be 0, 1, or many spans captured.

A transaction contains:

The timestamp of the event
A unique id, type, and name
Data about the environment in which the event is recorded:
- Service - environment, framework, language, etc.
- Host - architecture, hostname, IP, etc.
- Process - args, PID, PPID, etc.
- URL - full, domain, port, query, etc.
- User - (if supplied) email, ID, username, etc.
Other relevant information depending on the agent. Example: The JavaScript RUM agent captures transaction marks, which are points in time relative to the start of the transaction with some label.

In addition, agents provide options for users to capture custom metadata. Metadata can be indexed - labels, or not-indexed - custom.

Transactions are grouped by their type and name in the APM UI’sTransaction overview. If you’re using a supported framework, APM agents will automatically handle the naming for you. If you’re not, or if you wish to override the default, all agents have API methods to manually set the type and name.

type should be a keyword of specific relevance in the service’s domain, e.g. request, backgroundjob, etc.
name should be a generic designation of a transaction in the scope of a single service, e.g. GET /users/:id, UsersController#show, etc.

Most agents limit keyword fields (e.g. labels) to 1024 characters, non-keyword fields (e.g. span.db.statement) to 10,000 characters.

Transactions are stored in transaction indices.

Errors

An error event contains at least information about the original exception that occurred or about a log created when the exception occurred. For simplicity, errors are represented by a unique ID.

An Error contains:

Both the captured exception and the captured log of an error can contain a stack trace, which is helpful for debugging.
The culprit of an error indicates where it originated.
An error might relate to the transaction during which it happened, via the transaction.id.
Data about the environment in which the event is recorded:

Service - environment, framework, language, etc.
Host - architecture, hostname, IP, etc.
Process - args, PID, PPID, etc.
URL - full, domain, port, query, etc.
User - (if supplied) email, ID, username, etc.

In addition, agents provide options for users to capture custom metadata. Metadata can be indexed - labels, or not-indexed - custom.

Errors are stored in error indices.

Distributed tracinge

Together, Transactions and Spans form a Trace. Traces are not events, but group together events that have a common root.

Elastic APM supports distributed tracing. Distributed tracing enables you to analyze performance throughout your microservices architecture all in one view. This is accomplished by tracing all of the requests - from the initial web request to your front-end service - to queries made to your back-end services. This makes finding possible bottlenecks throughout your application much easier and faster. Best of all, there’s no additional configuration needed for distributed tracing, just ensure you’re using the latest version of the applicable agent.

The APM UI in Kibana also supports distributed tracing. The Timeline visualization has been redesigned to show all of the transactions from individual services that are connected in a trace:

Real User Monitoring (RUM)

Real User Monitoring captures user interaction with clients such as web browsers. The JavaScript Agent is Elastic’s RUM Agent. To use it you need to enable RUM support in the APM Server.

Unlike Elastic APM backend agents which monitor requests and responses, the RUM JavaScript agent monitors the real user experience and interaction within your client-side application. The RUM JavaScript agent is also framework-agnostic, which means it can be used with any frontend JavaScript application.

You will be able to measure metrics such as "Time to First Byte", domInteractive, and domComplete which helps you discover performance issues within your client-side application as well as issues that relate to the latency of your server-side application.

OpenTracing bridge

All Elastic APM agents have OpenTracing compatible bridges.

The OpenTracing bridge allows you to create Elastic APM transactionsand spans using the OpenTracing API. This means you can reuse your existing OpenTracing instrumentation to quickly and easily begin using Elastic APM.

Agent specific details

Not all features of the OpenTracing API are supported. In addition, there are some Elastic APM specific tags you should be aware of. Please see the relevant Agent documentation for more detailed information: