v6.5 EE Release Notes
Created:2024-11-05 Last Modified:2024-11-05
This document was translated by ChatGPT
#1. Zero Intrusion
#1.1 Tracing
- AutoTracing
- ⭐ Enhanced the ability to extract TraceID and SpanID from SQL statement comments, support parsing variable values in precompiled SQL statements, and support collecting login usernames and current database names. See documentation here.
- ⭐ Added parsing capability for the bRPC protocol. See documentation here.
- ⭐ Added parsing capabilities for RabbitMQ AMQP, ActiveMQ OpenWire, NATS, ZeroMQ, and Pulsar protocols. See documentation here.
- ⭐ Enhanced Kafka protocol parsing: added the ability to parse Partition, Offset, GroupID fields, and JoinGroup, LeaveGroup, SyncGroup messages; support extracting
correlation_id
from Kafka protocol headers asx_request_id_0/1
, automatically tracing Kafka call chains in Request-Response mode; support extracting SpanID from traceparent and sw8 in protocol headers, enhancing tracing capabilities. See documentation here. - ⭐ Support using Wasm Plugin to enhance the parsing of Dubbo, NATS, and ZeroMQ protocols. See demo here (opens new window).
- Support parsing Kryo serialization format for Dubbo protocol. See documentation here.
- Mark MySQL unidirectional messages (
CLOSE
,QUIT
) log type directly as session. - Added
captured_request_byte
andcaptured_response_byte
metrics to call logs. See documentation here (opens new window). - Enhanced the parsing capability of
X-Tingyun
TraceID.
- AutoTagging
- ⭐ Added
biz_type
tag to application metrics and call logs, which can be used with Wasm Plugin to identify business types. - ⭐ Kafka protocol supports extracting
topic_name
asendpoint
. See documentation here. - ⭐ Aggregated metrics no longer aggregate WAN server-side as
0.0.0.0
, and private IP addresses without any resource tags (192.168, 172.16, 10, 169.254) are no longer marked as WAN. - ⭐ Support custom collection of HTTP/HTTP2/gRPC header fields and store them in the
attribute.$field_name
field of call logs. See detailed documentation. - For A/AAAA type DNS requests, extract
QNAME
asrequest_domain
. See documentation here. - FastCGI, MQTT, and DNS protocols support extracting the
endpoint
field. See documentation here. - Added main IP (
pod_node_ip
,chost_ip
) and hostname (pod_node_hostname
,chost_hostname
) tags for container nodes and cloud servers to all data. - The
auto_service
tag automatically aggregates container nodes (pod_node
) into container clusters (pod_cluster
), butauto_instance
will not do this aggregation. - When a K8s workload (
pod_group
) is associated with multiple container services, the service name with the smallest dictionary order is used to mark the container service (pod_service
) tag.
- ⭐ Added
- Search Capabilities
- Added syntax sugar field XX, which can be used to match either of the two original fields
XX_0
orXX_1
. Supported fields include:x_request_id
,syscall_thread
,syscall_coroutine
,syscall_cap_seq
,syscall_trace_id
,tcp_seq
. - Added role grouping capability on the resource analysis page to distinguish statistics when resources act as clients or servers.
- Optimized the loading speed when switching the search box to container search or process search mode.
- The client and server columns in the aggregated data table support copy-pasting to the search bar.
- When entering resource filter conditions, candidate options support hovering to prompt resource information.
- Added syntax sugar field XX, which can be used to match either of the two original fields
- Usability Enhancements
- ⭐ Linked call chain tracing with flow logs to view network performance metrics of Spans.
- ⭐ Call chain tracing and topology analysis pages support using DeepFlow Stella agent for intelligent analysis and interpretation, supporting the GPT4 model.
- Optimized the user experience of the search bar in "click search button to trigger" mode.
- Support remembering the activated state of the Tab below the call chain tracing flame graph and stabilizing the Tab layout.
- Linked highlighting of Spans in the call chain tracing flame graph and call logs in the table below.
- Optimized the presentation of Span tracing in call chain tracing.
- Optimized the parent-child logic of NET Spans in the call chain tracing flame graph.
- Improved the zoom in and zoom out experience of the topology graph.
- Enhanced the usability of copying knowledge graph tags.
- Displayed delay 0 as N/A in tables.
- Optimized the display of the "query area" in data tags.
- Support displaying resource icons by application protocol.
#1.2 Profiling
- AutoProfiling
- ⭐ Support Off-CPU Profiling, low overhead, continuous operation, can be used to quickly locate bottleneck functions when application performance is low but CPU usage is not high.
- Usability Enhancements
- ⭐ Performance profiling flame graph supports using DeepFlow Stella agent for intelligent analysis and interpretation, supporting the GPT4 model.
- Changed the first line name in the flame graph from root to
$app_service
, which is the process name collected by eBPF or the service name set internally by the application. - Optimized the loading speed when switching the search box to container search or process search mode.
- Differentiated the types of function names displayed in the eBPF flame graph: kernel functions, dynamic library functions, application functions.
- Optimized the Tip display of the eBPF flame graph.
#1.3 Network
- AutoMetrics
- Exposed traffic distribution metrics, supporting monitoring of traffic rates matching specific traffic distribution strategies.
- Renamed anomaly metrics: Connection-Client SYN End (
client_syn_repeat
) renamed to Connection-Server SYN Missing (server_syn_miss
) and included inserver anomalies
. - Renamed anomaly metrics: Connection-Server SYN End (
server_syn_repeat
) renamed to Connection-Client ACK Missing (client_ack_miss
) and included inclient anomalies
. - Set the status of flow logs with TCP disconnection anomalies to normal.
- AutoTagging
- ⭐ Added
request_domain
field to network flow logs, automatically associating with application metrics and call logs. - ⭐ Aggregated metrics no longer aggregate WAN server-side as
0.0.0.0
, and private IP addresses without any resource tags (192.168, 172.16, 10, 169.254) are no longer marked as WAN. - Added main IP (
pod_node_ip
,chost_ip
) and hostname (pod_node_hostname
,chost_hostname
) tags for container nodes and cloud servers to all data. - The
auto_service
tag automatically aggregates container nodes (pod_node
) into container clusters (pod_cluster
), butauto_instance
will not do this aggregation. - When a K8s workload (
pod_group
) is associated with multiple container services, the service name with the smallest dictionary order is used to mark the container service (pod_service
) tag.
- ⭐ Added
- Search Capabilities
- Added syntax sugar field XX, which can be used to match either of the two original fields
XX_0
orXX_1
. Supported fields include:tunnel_tx_ip
,tunnel_rx_ip
,tunnel_tx_mac
,tunnel_rx_mac
,tcp_seq
. - Added role grouping capability on the resource analysis page to distinguish statistics when resources act as clients or servers.
- Optimized the loading speed when switching the search box to container search or process search mode.
- The client and server columns in the aggregated data table support copy-pasting to the search bar.
- When entering resource filter conditions, candidate options support hovering to prompt resource information.
- Added syntax sugar field XX, which can be used to match either of the two original fields
- Usability Enhancements
- ⭐ Topology analysis page supports using DeepFlow Stella agent for intelligent analysis and interpretation, supporting the GPT4 model.
- Optimized the user experience of the search bar in "click search button to trigger" mode.
- Improved the zoom in and zoom out experience of the topology graph.
- Enhanced the usability of copying knowledge graph tags.
- Displayed delay 0 as N/A in tables.
- Optimized the display of the "query area" in data tags.
- Optimized the display of the access relationship right slide-out panel.
#2. Customization
#2.1 Dashboard
- Panel Enhancements
- ⭐ Added text-type Panels, supporting Markdown and Mermaid syntax.
- ⭐ Support adding Markdown descriptions to Panels.
- ⭐ Support customizing the right slide-out panel Tab page for all Panels, automatically associating the data displayed in the Tab page.
- Panels with multiple query conditions support waking up the right slide-out panel, automatically associating all observability data.
- The background curve of the overview chart supports hiding the coordinate axis.
- Optimized the color selection box on the Panel editing page.
- Optimized the style, metric settings, and advanced settings of Panels.
- Usability Enhancements
- Support copying and cloning Panels.
- Added metric setting function to the Panel editing box.
- The detail table supports sorting by start time and end time columns.
- The list page supports sorting by name, creator, and modification time.
- Optimized the ability to set icon information in the Panel editing box.
- Optimized the interaction of the new Panel creation box.
- Optimized the legend display of line charts, bar charts, and pie charts.
- Moved the modify metric button of Panels into the right slide-out panel for editing.
- Optimized the layout and style of the Panel page, and optimized the layout and style of the right slide-out panel for editing Panels.
- Split the Dashboard list into two pages: custom Dashboards and built-in Dashboards.
- Support switching the chart type of Panels.
- Optimized the search module on the Panel editing page.
#2.2 Universal Map
- Usability Enhancements
- ⭐ Optimized data display in the physical network section, enhancing the usability of "cloud and on-premises integrated monitoring".
- Support batch (multi-select client services, server services) definition of paths in the business.
- Improved the zoom in and zoom out experience of the topology graph.
- Optimized the operation experience in the topology graph editing mode, and optimized the operation experience of arranging services and service groups in the topology graph.
- Optimized the operation experience of the right slide-out panel in the universal map.
#3. Integration
#3.1 Metrics
- Metric Templates
- ⭐ Added metric template management capabilities, facilitating quick selection of metric sets on tracing, network, and Dashboard pages.
#3.2 Logs
- ⭐ Support integration with application logs collected by Datadog Vector (opens new window).
#4. Operations
#4.1 Alerts
- Alert Policies
- ⭐ Enhanced granularity: added configuration capabilities for monitoring frequency and monitoring intervals.
- ⭐ Refined event types: added configuration capabilities for recovery events and information events.
- Push Endpoints
- Added Kafka push endpoint, supporting Plain type SASL authentication.
- System Alerts
- When the disk space where ClickHouse is located is insufficient, deepflow-server will perform a forced cleanup, triggering a built-in system alert to notify the user.
- Added more comprehensive metrics to the alert for collector data loss.
- Usability Enhancements
- Optimized the display of the alert policy list and alert event list.
#4.2 Reports
N/A
#5. Management
#5.1 Resources
- AutoTagging
- ⭐ Significantly improved the real-time performance of K8s tag injection. The previous code path involved 5 independent 1-minute timers, while the optimized path only involves 1 10-second timer and 1 1-minute timer. The worst-case delay is reduced from 5 minutes to 1 minute and 20 seconds (the agent's list/watch of K8s resources may span two cycles, so the worst-case delay may be 20 seconds).
- Enhanced the ability to synchronize resource information with Ping An Cloud, supporting the acquisition of CIDR for tenant Pods in Serverless clusters.
- By default, the enterprise edition disables the Agent from automatically triggering the generation of Kubernetes-type cloud platforms, simplifying the deployment steps in On-Prem mode.
- Support synchronizing custom tags of cloud servers in Alibaba Cloud and automatically injecting
cloud.tag.$key
tag fields into all observability data. - Decoupled the synchronization of cloud resources and container resources, so that errors in the public cloud API do not affect the synchronization of container resource tags.
- Usability Improvements
- Excluded deleted resources from the resource count displayed in the knowledge graph.
#5.2 System
- SQL API
- ⭐ Optimized the Percentile operator for Delay and BoundedGauge type metrics, reducing the number of layers in the compiled ClickHouse SQL to one.
- Modified ClickHouse table names and field names, see the table at the end (deprecated names can still be used, but will no longer be supported starting from v7.0).
- Data in the
flow_log
andevent
databases support precise search using the_id
field. - Simplified the query semantics of map-type fields for easier user understanding.
- Server
- ⭐ Added Kafka Exporter data export method. See documentation here, supporting the export of the following observability signals:
- Metrics:
flow_metrics.application*
(application performance metrics/access relationships),flow_metrics.network*
(network performance metrics/access relationships). - Logs:
flow_log.l4_flow_log
(network flow logs),flow_log.l7_flow_log
(application call logs). - Events:
event.perf_event
(file read/write events).
- Metrics:
- Prometheus Remote Write supports exporting metrics from
flow_metrics.application*
andflow_metrics.network*
. - Added a global configuration for whether the Agent requests the Server NAT IP, suitable for scenarios where all Agents request the Server through the public network.
- Added a Token management page and optimized the Token timeout determination mechanism.
- Traffic distribution strategies support export and import.
- ⭐ Added Kafka Exporter data export method. See documentation here, supporting the export of the following observability signals:
- Agent
- ⭐ Default enabled system load circuit breaker mechanism. When the ratio of system load to CPU cores exceeds
system_load_circuit_breaker_threshold
, the Agent triggers the circuit breaker mechanism, automatically entering a disabled state and alerting. Configuration details can be found in the Agent configuration sample. - ⭐ Optimized Redis and MySQL protocol parsing performance: after optimization, an Agent with 1 CPU and 300MB memory can collect 50K TPS MySQL or Redis traffic.
- Added
flow-count-limit
configuration parameter to prevent the Agent from consuming too much memory under sudden traffic, avoiding triggering the OOM Killer. - ⭐ Improved HTTP2 Huffman decoding performance. Under the condition of limited 1 logical core, the extreme TPS collection performance increased by 5 to 25 times. Test data is shown in the table below.
- ⭐ Support configuring call log blacklist to reduce storage consumption, eliminate large delay metrics interference from health checks, and eliminate DNS NXDOMAIN anomaly interference.
- ⭐ Support eBPF data out-of-order reordering and segment reassembly, enhancing the success rate of application protocol parsing.
- Support collecting traffic from Open vSwitch Bond sub-interfaces and correctly aggregating them into flow logs and call logs.
- Dedicated collectors support stripping ERSPAN, TEB, and VXLAN tunnel encapsulation from mirrored traffic.
- Improved eBPF collection performance [test data to be supplemented].
- Added 6443 (default port for K8s apiserver) to the default parsed ports for the TLS protocol.
- Allowed collectors to remotely execute low-privilege debug commands.
- ⭐ Default enabled system load circuit breaker mechanism. When the ratio of system load to CPU cores exceeds
- Deployment
- Container-type collectors support remote upgrades. See documentation here; support direct configuration of CPU and memory limits from the page. See configuration parameters here (opens new window).
- Agent supports deployment via Docker Compose. See documentation here.
- Usability Improvements
- ⭐ Added
AskGPT
Copilot to DeepFlow Topo and DeepFlow Tracing Panel in Grafana: Demo1 (opens new window), Demo2 (opens new window). Currently supported models include GPT4, Tongyi Qianwen, Wenxin Yiyan, ChatGLM. - URLs in the page are URL-ized, supporting opening in a new page through the right-click menu.
- Simplified the URL length of the page.
- ⭐ Added
HTTP2 Collection Performance Comparison Test:
Random Header Count | Version | Agent CPU | Agent Memory | TPS |
---|---|---|---|---|
3 | OLD | 96% | 34 MB | 10K |
NEW | 97% | 94 MB | 50K | |
12 | OLD | 89% | 9 MB | 1.2K |
NEW | 93% | 112 MB | 30K |
#5.3 Account
- Multi-Tenant Support
- ⭐ Support creating multiple isolated organizations to meet the isolation needs of large enterprises with multiple subsidiaries and business units, and support joint operation of SaaS services with industry clouds.
- Support setting tenant account permissions, including four roles: owner, maintainer, user, and guest.
- Support dividing tenant accounts into teams according to the organizational structure and setting the visibility of resources within the team.
- Support Google and GitHub account SSO.
- Usability Improvements
- Added a preference settings page, allowing configuration of search box trigger methods, search box display forms, icon display, and other behaviors.
#6. Compatibility
#6.1 Incompatible Changes
- eBPF AutoProfiling
- The units for
self_value
andtotal_value
returned by the APIprofile/ProfileTracing
have been changed to microseconds (µs). See the documentation here, and the change history here (opens new window).
- The units for
- AutoTagging
- Synchronization of security group information in cloud resources is no longer supported.
- Server
- The configuration method for Prometheus Remote Write has been adjusted. See the documentation here.
- The configuration method for OpenTelemetry Exporter has been adjusted. See the documentation here.
- Agent
- The static configuration item
src-interfaces
has been merged into the dynamic configuration itemtap_interface_regex
, reducing configuration complexity in scenarios such as MACVlan, Huawei Cloud CCE Turbo, VMware, etc.
- The static configuration item
#6.2 Compatible Changes
Note: The following changes will no longer be compatible starting from v7.0.
Modifications to table names in the ClickHouse flow_metrics
database:
Old Name | New Name | Data Function |
---|---|---|
vtap_app_port | application | Application performance metrics for all services |
vtap_app_edge_port | application_map | Application access relationships and their performance metrics |
vtap_flow_port | network | Network performance metrics for all services |
vtap_flow_edge_port | network_map | Network access relationships and their performance metrics |
vtap_acl | traffic_policy | Network policy metrics (Enterprise Edition only) |
Modifications to field names in the ClickHouse database:
Old Name | New Name | Data Function |
---|---|---|
vtap | agent | Agent |
vtap_id | agent_id | Agent ID |
tap_side | observation_point | Observation point |
tap | capture_network_type | Network location (Enterprise Edition only) |
tap_port | capture_nic | Capture NIC identifier |
tap_port_name | capture_nic_name | Capture NIC name |
tap_port_type | capture_nic_type | Capture NIC type |
tap_port_host | capture_nic_host | Host machine of the capture NIC (Enterprise Edition only) |
tap_port_chost | capture_nic_chost | Cloud server of the capture NIC |
tap_port_pod_node | capture_nic_pod_node | Container node of the capture NIC |
#7. Documentation
- Added a new Agent Performance Tuning document.
- Added a deployment plan for scenarios where deepflow-agent is not allowed to request apiserver.
- Added guidance for running deepflow-agent as a non-root user.