Introduction to DeepFlow

Created:2022-07-25 Last Modified:2024-06-24

This document was translated by ChatGPT

#1. What is DeepFlow

DeepFlow is an observability product developed by Yunshan Networks (opens new window), designed to provide deep observability for complex cloud infrastructures and cloud-native applications. DeepFlow leverages eBPF to achieve zero-intrusion (Zero Code) collection of observability signals such as application performance metrics, distributed tracing, and continuous profiling. It also uses SmartEncoding technology to achieve full-stack (Full Stack) correlation and efficient storage of all observability signals. With DeepFlow, cloud-native applications can automatically gain deep observability, eliminating the heavy burden of constant instrumentation for developers and providing DevOps/SRE teams with monitoring and diagnostic capabilities from code to infrastructure.

To encourage global developers and researchers in the observability field to innovate and contribute more, the core modules of DeepFlow have been open-sourced under the Apache 2.0 License (opens new window). Additionally, an academic paper titled 《Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code》 (opens new window) was officially published at the ACM SIGCOMM 2023 (opens new window), a top international conference in the field of network communication.

#2. Core Features

  • Universal service map for any service: Utilizing the leading AutoMetrics mechanism and eBPF technology, DeepFlow can zero-intrusively draw a universal service map of the production environment, including services developed in any language, third-party services with unknown code, and all cloud-native infrastructure services. It has built-in capabilities to parse numerous application protocols and provides a Wasm plugin mechanism to extend the parsing of any proprietary protocol. It zero-intrusively calculates full-stack golden metrics for each call within the application and infrastructure, quickly identifying performance bottlenecks.
  • Distributed tracing for any request: Using the leading AutoTracing mechanism and leveraging eBPF and Wasm technology, DeepFlow achieves zero-intrusion distributed tracing, supporting applications in any language and fully covering gateways, service meshes, databases, message queues, DNS, network cards, and other infrastructure, leaving no tracing blind spots. Full-stack, it automatically collects network performance metrics and file read/write events associated with each Span. This marks the beginning of a new era of zero-instrumentation distributed tracing.
  • Continuous profiling for any function: Based on the leading AutoProfiling mechanism, DeepFlow uses eBPF technology to zero-intrusively collect performance profiling data of production environment processes with less than 1% overhead, drawing function-level On-CPU and Off-CPU flame graphs. It quickly identifies full-stack performance bottlenecks in application functions, library functions, and kernel functions, and automatically correlates them with distributed tracing data. Even under kernel versions 2.6+, it can still provide network performance profiling capabilities, gaining insights into code performance bottlenecks.
  • Seamless integration with popular observability tech stacks: DeepFlow can serve as a storage backend for Prometheus, OpenTelemetry, SkyWalking, and Pyroscope, and provides SQL, PromQL, OTLP data interfaces as data sources for popular tech stacks. Using the leading AutoTagging mechanism, it automatically injects unified tags into all observability signals, including cloud resources, K8s container resources, K8s Labels/Annotations, and business attributes in CMDB, eliminating data silos.
  • Storage performance 10x ClickHouse: Utilizing the leading SmartEncoding mechanism, DeepFlow injects standardized, pre-encoded meta tags into all observability signals, reducing storage overhead by 10x compared to ClickHouse's String or LowCard solutions. Custom tags are stored separately from observability data, allowing you to confidently inject nearly unlimited dimensions and cardinalities of tags while enjoying a BigTable-like easy query experience.

#3. Addressing Two Major Pain Points

Traditional solutions, such as APM, aim to achieve application observability through code instrumentation. Instrumentation allows applications to expose a rich set of observability signals, including metrics, tracing, logs, and function performance profiling. However, the act of instrumentation actually changes the internal state of the original program, logically contradicting the requirement of observability to "determine internal states from external data." In core business systems of important industries like finance and telecommunications, deploying APM Agents is very challenging. In the cloud-native era, this traditional method faces even more severe challenges. Overall, the problems with APM are mainly reflected in two aspects: the intrusiveness of Agents makes deployment difficult, and observability blind spots make triage impossible.

First, the intrusiveness of probes makes deployment difficult. The process of instrumentation requires modifying the source code of the application and redeploying it. Even bytecode enhancement technologies like Java Agent require modifying the application's startup parameters and redeploying. However, modifying application code is just the first hurdle; many other issues often arise during deployment:

  1. Code conflicts: Have you ever encountered runtime conflicts between different Agents when injecting multiple Java Agents for purposes like distributed tracing, performance profiling, logging, or service mesh? Have you faced dependency library version conflicts that prevent successful compilation when introducing an observability SDK? The more business teams there are, the more prominent these compatibility issues become.
  2. Maintenance difficulties: If you are responsible for maintaining your company's Java Agent or SDK, how frequently can you update it? How many versions of probe programs are currently running in your company's production environment? How long would it take to update them all to the same version? How many languages' probe programs do you need to maintain simultaneously? When the microservice framework or RPC framework of the enterprise is not unified, these maintenance issues become even more severe.
  3. Blurred boundaries: All instrumentation code seamlessly integrates into the business code's runtime logic, making it indistinguishable and uncontrollable. This leads to the instrumentation code often being blamed when performance degradation or runtime errors occur. Even if the probe has been battle-tested for a long time, it is still inevitable to be suspected when issues arise.

In fact, this is why intrusive instrumentation solutions are rarely seen in successful commercial products and are more common in active open-source communities. The activity of communities like OpenTelemetry and SkyWalking is evidence of this. In large enterprises with clear departmental divisions, overcoming collaboration difficulties is an unavoidable hurdle for a technical solution to be successfully implemented. Especially in critical industries like finance, telecommunications, and power, where departmental responsibilities and conflicts of interest often make it "impossible" to implement instrumentation-based solutions. Even in open and collaborative internet companies, there are still issues like developers' reluctance to instrument, and operations personnel being blamed for performance failures. After long-term efforts, people have realized that intrusive solutions are only suitable for each business development team to introduce and maintain various Agents and SDKs themselves and take responsibility for performance risks and runtime failures.

Second, observability blind spots make triage impossible. Even if APM has been implemented in the enterprise, we still find it difficult to delineate the boundaries of troubleshooting, especially in cloud-native infrastructures. This is because developers and operations often speak different languages. For example, when call latency is too high, developers may suspect slow networks, slow gateways, slow databases, or slow servers. However, due to the lack of full-stack observability, the responses from the network, gateway, and database are usually a bunch of unrelated metrics like no packet loss on the network card, low CPU usage on the process, no slow logs in the DB, and low server latency, which still do not solve the problem. Triage is the most critical step in the entire fault handling process, and its efficiency is crucial.

If you are a business development engineer, you should be concerned about system calls and network transmission processes in addition to the business itself; if you are a Serverless tenant, you may also need to pay attention to the service mesh sidecar and its network transmission; if you directly use virtual machines or self-built K8s clusters, then container networks are key points to focus on, especially core services like CoreDNS and Ingress Gateway in K8s; if you are a private cloud computing service administrator, you should be concerned about network performance on KVM hosts; if you are a private cloud gateway, storage, or security team, you also need to pay attention to system calls and network transmission performance on service nodes. More importantly, the data used for fault triage should be presented in a similar language: how much time did each hop of an application call take in the entire full-stack path. We found that the observability data provided by developers through instrumentation might only account for 1/4 of the entire full-stack path. In the cloud-native era, relying solely on APM to solve fault triage is a delusion.

#4. Using eBPF Technology

Assuming you have a basic understanding of eBPF, it is a secure and efficient technology for extending kernel functionality by running programs in a sandbox, a revolutionary innovation compared to traditional methods of modifying kernel source code and writing kernel modules. eBPF programs are event-driven, and when the kernel or user programs pass through an eBPF Hook, the corresponding eBPF program loaded at the Hook point will be executed. The Linux kernel predefines a series of commonly used Hook points, and you can also dynamically add custom Hook points in the kernel and applications using kprobe and uprobe technologies. Thanks to Just-in-Time (JIT) technology, the execution efficiency of eBPF code can be comparable to native kernel code and kernel modules. Thanks to the Verification mechanism, eBPF code will run safely without causing kernel crashes or entering infinite loops.

https://ebpf.io/what-is-ebpf/#hook-overview

https://ebpf.io/what-is-ebpf/#hook-overview

The sandbox mechanism is the core difference between eBPF and APM instrumentation mechanisms. "Sandbox" draws a clear boundary between eBPF code and application code, allowing us to determine the internal state of the application by obtaining external data without making any modifications to the application. Let's analyze why eBPF is an excellent solution to the defects of APM code instrumentation:

First, zero-intrusion solves the deployment difficulty. Since eBPF programs do not require modifying application code, there will be no runtime conflicts like those with Java Agents and no compilation conflicts with SDKs, solving the code conflict problem; since running eBPF programs does not require changing and restarting application processes, there is no need for application redeployment, avoiding the version maintenance pain of Java Agents and SDKs, solving the maintenance difficulty problem; since eBPF runs efficiently and safely under the protection of JIT technology and the Verification mechanism, there is no need to worry about unexpected performance degradation or runtime errors in application processes, solving the blurred boundary problem. Additionally, from a management perspective, since only one independent eBPF Agent process needs to run on each host, we can precisely control its CPU and other resource consumption.

Second, full-stack capability solves the fault triage difficulty. eBPF's capabilities cover every layer from the kernel to user programs, allowing us to track a request from the application, through system calls, network transmission, gateway services, security services, to the database service or peer microservice in the full-stack path, providing sufficient neutral observability data to quickly complete fault triage.

For a more detailed analysis of this aspect, please refer to our article 《eBPF 是实现可观测性的关键技术》 (opens new window).

It is important to emphasize that this does not mean DeepFlow only uses eBPF technology. On the contrary, DeepFlow supports seamless integration with popular observability tech stacks, such as serving as a storage backend for Prometheus, OpenTelemetry, SkyWalking, Pyroscope, and other observability signals.

#5. Mission and Vision

  • Mission: Make observability simpler.
  • Vision: Become the preferred choice for achieving observability in cloud-native applications.