Features
Universal Map

Metrics and Operators Calculation Logic

Created：2024-06-01 Last Modified：2024-10-14

This document was translated by ChatGPT

This article will introduce different types of metrics and the calculation logic of various operators.

#1. Metrics

Metrics are divided into two main categories: Application Performance Metrics and Network Performance Metrics.

#1.1 Application Performance Metrics

Application metrics are used to measure the performance of services during actual operation, focusing mainly on service throughput, response delay, and anomalies. By collecting these metrics, operations personnel and developers can better understand the performance of applications in real-world usage, identify potential performance issues, and take appropriate measures for optimization and improvement.

The metrics described below will record a metric value in each statistical cycle, which can be customized by the user. The system currently supports 1m (one minute) and 1s (one second) by default (these data are collectively referred to as raw data sources in the DeepFlow platform). If multiple metric values are calculated within a statistical cycle, they will be aggregated into one metric value. The aggregation logic is described in the subsequent Types section.

#1.1.1 Throughput

Field	DisplayName	Unit	Type	Description
request	Request		counter
response	Response		counter

generate from csv file: application.en?Category=Throughput

#1.1.2 Delay

Field	DisplayName	Unit	Type	Description
rrt	Avg Delay	us	delay
rrt_max	Max Delay	us	delay

generate from csv file: application.en?Category=Delay

#1.1.3 Error

Field	DisplayName	Unit	Type
error	Error		counter
client_error	Client Error		counter
server_error	Server Error		counter
timeout	Timeout		counter
error_ratio	Error %	%	percentage
client_error_ratio	Client Error %	%	percentage
server_error_ratio	Server Error %	%	percentage
timeout_ratio	Timeout %	%	percentage

generate from csv file: application.en?Category=Error

#1.2 Network Performance Metrics

Network metrics are quantitative indicators used to evaluate network performance, covering the network layer, transport layer, and application layer. These metrics include throughput, delay, performance, and anomaly types.

#1.2.1 L3 Throughput

Field	DisplayName	Unit	Type
byte	Byte	Byte	counter
byte_tx	Byte TX	Byte	counter
byte_rx	Byte RX	Byte	counter
packet	Packet	Packet	counter
packet_tx	Packet TX	Packet	counter
packet_rx	Packet RX	Packet	counter
l3_byte	L3 Payload	Byte	counter
l3_byte_tx	L3 Payload TX	Byte	counter
l3_byte_rx	L3 Payload RX	Byte	counter
bpp	Bytes per Packet	Byte	quotient
bpp_tx	Bytes per Packet TX	Byte	quotient
bpp_rx	Bytes per Packet RX	Byte	quotient

generate from csv file: network.en?Category=L3 Throughput

#1.2.2 L4 Throughput

Field	DisplayName	Unit	Type
new_flow	New Flow	Flow	counter
closed_flow	Closed Flow	Flow	counter
flow_load	Active Flow	Flow	gauge
syn_count	SYN Packet	Packet	counter
synack_count	SYN-ACK Packet	Packet	counter
l4_byte	L4 Payload	Byte	counter
l4_byte_tx	L4 Payload TX	Byte	counter
l4_byte_rx	L4 Payload RX	Byte	counter

generate from csv file: network.en?Category=L4 Throughput

Active connection calculation logic:

The collector counts the raw active connections based on the quadruple (client IP, server IP, protocol, server port) and then calculates the active connections corresponding to resources and paths.
If traffic is collected within the time interval corresponding to the data source, active connections are counted, but there are some special cases:
- 1s data source: Describes the active connections counted per second.
  - The first second of each minute: Includes connections that have no traffic within that second but have not ended, generally used to evaluate concurrent connections (multiple non-overlapping connections with a duration of less than one second may introduce some errors).
  - The last 59 seconds of each minute: If multiple flows with the same quadruple have no traffic within that second, the connections corresponding to that quadruple will be ignored for that second, generally used to evaluate the lower bound of concurrent connections.
- 1m data source: Describes the active connections counted per minute.
  - Includes connections that have no traffic but have not ended, generally used to evaluate the upper bound of concurrent connections.
- Custom data source: Calculated based on 1s/1m data sources using Avg/Max/Min, with the same meaning as directly using 1s/1m data sources and selecting Avg/Max/Min operators.

#1.2.3 TCP Slow

Field	DisplayName	Unit	Type
retrans_syn	SYN Retransmission	Packet	counter
retrans_synack	SYN-ACK Retransmission	Packet	counter
retrans	TCP Retransmission	Packet	counter
retrans_tx	TCP Client Retransmission	Packet	counter
retrans_rx	TCP Server Retransmission	Packet	counter
zero_win	TCP ZeroWindow	Packet	counter
zero_win_tx	TCP Client ZeroWindow	Packet	counter
zero_win_rx	TCP Server ZeroWindow	Packet	counter
retrans_syn_ratio	SYN Retrans. %	%	percentage
retrans_synack_ratio	SYN-ACK Retrans. %	%	percentage
retrans_ratio	TCP Retrans. %	%	percentage
retrans_tx_ratio	TCP Client Retrans. %	%	percentage
retrans_rx_ratio	TCP Server Retrans. %	%	percentage
zero_win_ratio	TCP ZeroWindow %	%	percentage
zero_win_tx_ratio	TCP Client ZeroWindow %	%	percentage
zero_win_rx_ratio	TCP Server ZeroWindow %	%	percentage

generate from csv file: network.en?Category=TCP Slow

#1.2.4 TCP Error

Field	DisplayName	Unit	Type	Description
tcp_establish_fail	Error	Flow	counter
client_establish_fail	Client Error	Flow	counter
server_establish_fail	Server Error	Flow	counter
tcp_establish_fail_ratio	Error %	%	percentage
client_establish_fail_ratio	Client Error %	%	percentage
server_establish_fail_ratio	Client Error %	%	percentage
tcp_transfer_fail	Transfer Error	Flow	counter	All transfer errors.
tcp_transfer_fail_ratio	Transfer Error %	%	percentage
tcp_rst_fail	RST	Flow	counter	All RST errors.
tcp_rst_fail_ratio	RST %	%	percentage
client_source_port_reuse	Est. - Client Port Reuse	Flow	counter
server_syn_miss	Est. - Server SYN Miss	Flow	counter
client_establish_other_rst	Est. - Client Other RST	Flow	counter
client_ack_miss	Est. - Client ACK Miss	Flow	counter
server_reset	Est. - Server Direct RST	Flow	counter
server_establish_other_rst	Est. - Server Other RST	Flow	counter
client_rst_flow	Transfer - Client RST	Flow	counter
server_rst_flow	Transfer - Server RST	Flow	counter
server_queue_lack	Transfer - Server Queue Overflow	Flow	counter
tcp_timeout	Transfer - TCP Timeout	Flow	counter
client_half_close_flow	Close - Client Half Close	Flow	counter
server_half_close_flow	Close - Server Half Close	Flow	counter

generate from csv file: network.en?Category=TCP Error

#1.2.4.1 TCP Connection Errors

TCP 建连异常

#1.2.4.2 TCP Transmission Errors

TCP 传输异常

#1.2.5 Delay

Field	DisplayName	Unit	Type
rtt	Avg TCP Est. Delay	us	delay
rtt_client	Avg TCP Est. Client Delay	us	delay
rtt_server	Avg TCP Est. Server Delay	us	delay
srt	Avg TCP/ICMP Response Delay	us	delay
art	Avg Data Delay	us	delay
cit	Avg Client Idle Delay	us	delay
rtt_max	Max TCP Est. Delay	us	delay
rtt_client_max	Max TCP Est. Client Delay	us	delay
rtt_server_max	Max TCP Est. Server Delay	us	delay
srt_max	Max TCP/ICMP Response Delay	us	delay
art_max	Max Data Delay	us	delay
cit_max	Max Client Idle Delay	us	delay

generate from csv file: network.en?Category=Delay

TCP 网络时延解剖

Delay generated during connection establishment
- [1] The complete connection establishment delay includes the entire time from the client sending the SYN packet to receiving the SYN+ACK packet from the server and then replying with an ACK packet. The connection establishment delay can be further divided into client connection establishment delay and server connection establishment delay.
- [2] Client connection establishment delay is the time taken for the client to reply with an ACK packet after receiving the SYN+ACK packet.
- [3] Server connection establishment delay is the time taken for the server to reply with a SYN+ACK packet after receiving the SYN packet.
Delay generated during data communication can be divided into client waiting delay + data transmission delay.
- [4] Client waiting delay is the time taken for the client to send the first request after the connection is successfully established; it is also the time taken for the client to send a data packet after receiving a data packet from the server.
- [5] Data transmission delay is the time taken for the client to send a data packet and receive a reply data packet from the server.
- [6] During data transmission delay, there is also a delay generated by the system protocol stack, called system delay, which is the time taken for the data packet to receive an ACK packet.

#1.2.6 Application

Field	DisplayName	Unit	Type
l7_request	Request		counter
l7_response	Response		counter
rrt	Avg App. Delay	us	delay
rrt_max	Max App. Delay	us	delay
l7_error	App. Error		counter
l7_client_error	App. Client Error		counter
l7_server_error	App. Server Error		counter
l7_timeout	App. Server Timeout		counter
l7_error_ratio	App. Error %	%	percentage
l7_client_error_ratio	App. Client Error %	%	percentage
l7_server_error_ratio	App. Server Error %	%	percentage

generate from csv file: network.en?Category=Application

#1.2.7 Cardinality

During the statistical cycle, the number of unique tags collected is counted. For example, querying the client IP address (ip_0) metric for all accesses to pod_1 means counting the number of unique client IP addresses in all traffic accessing pod_1.

Field	DisplayName	Unit	Type	Description

generate from csv file: network.en?Category=Cardinality

#2. Operators

Operators calculate data from raw data sources based on the selected time range and interval. For example, using a line chart to view 1s raw data sources for the last 5 minutes with a 20s interval, a point on the line chart (14:43:00) would read all data within the time range of 14:42:40 - 14:43:00 and then calculate the average value.

Operators support nested stacking, but aggregate operators do not support stacking. For example, PerSecond(Avg(byte)) means calculating Avg(byte) first, and then the resulting value is recalculated based on PerSecond.

#2.1 Aggregate Operators

Operator	English Name	Applicable Metric Types	Description
Avg	Average	All types	Average value (does not ignore zero values for Counter/Gauge metrics)
AAvg	Arithmetic Average	All types	Arithmetic average (first calculate the average at each time point, then calculate the average of the averages)
Sum	Sum	Counter type	Sum
Max	Maximum	All types	Maximum value
Min	Minimum	All types	Minimum value
Percentile	Estimated Percentile	All types	Estimated percentile
PercentileExact	Exact Percentile	All types	Exact percentile
Spread	Spread	All types	Absolute spread, Max minus Min within the statistical cycle
Rspread	Relative Spread	All types	Relative spread, Max divided by Min within the statistical cycle
Stddev	Standard Deviation	All types	Standard deviation
Apdex	Application Performance Index	Delay type	Delay satisfaction
Last	Last	All types	Latest value
Uniq	Estimated Uniq	Cardinality type	Estimated cardinality
UniqExact	Exact Uniq	Cardinality type	Exact cardinality

#2.2 Secondary Operators

Operator	Description
PerSecond	Calculate rate, divide the result of the inner operator by the time interval [1]
Math	Arithmetic operations, supports +, -, *, /
Percentage	Unit conversion %

[1] For example: PerSecond(Sum) means calculating the sum first, then dividing by the time interval interval passed by the API; PerSecond(Avg) means calculating the average first, then dividing by the data source time interval data_precision.

#3. Calculation Logic of Different Metrics' Operators

#3.1 Counter/Gauge Metrics

flow_metrics data table
- Sum operator
  - Calculate the Sum of all data within the query time range
- Avg operator
  - Calculate the Sum of all data within the query time range and divide by interval/data_precision
- Other operators
  - First use Sum to aggregate based on data_precision
  - Then call the ClickHouse function for the selected specific operator
- When forced (due to the need for other metrics in the same statement) to use two layers of SQL calculations
  - Sum/Avg operator
    - First use Sum to aggregate based on data_precision
    - Then call the ClickHouse function for the selected specific operator
flow_log data table
- Call the ClickHouse function for the selected specific operator
prometheus/ext_metrics/deepflow_system data table
- Same as flow_metrics data table
Additional notes
- The Min operator fills 0 for time points with no data or data as null

#3.2 Quotient/Percentage Metrics

flow_metric data table
- Avg operator
  - Calculate Sum(x)/Sum(y) for all data within the query time range
- Other operators
  - First use Sum(x)/Sum(y) to aggregate based on data_precision
  - Then call the ClickHouse function for the selected specific operator
- When forced (due to the need for other metrics in the same statement) to use two layers of SQL calculations
  - Avg operator
    - First use Sum(x)/Sum(y) to aggregate based on data_precision
    - Then call the ClickHouse function for the selected specific operator
flow_log data table
- Call the ClickHouse function func(x/y) for the selected specific operator
Additional notes
- The Min operator for Percentage metrics fills 0 for time points with no data
- When calculating Sum(x)/Sum(y), points with a denominator of 0/null or a numerator of null are ignored

#3.3 Delay/BoundedGauge Metrics

flow_metric data table
- Call the ClickHouse function for the selected specific operator
- When forced (due to the need for other metrics in the same statement) to use two layers of SQL calculations
  - Avg/Min/Max operator
    - Both layers call the ClickHouse function for the selected specific operator
  - Spread/Rspread operator
    - First use Max and Min to aggregate based on data_precision
    - Then call the ClickHouse function for the selected specific operator
  - Other operators
    - First use groupArray to aggregate
    - Then call the ClickHouse function for the selected specific operator
flow_log data table
- Call the ClickHouse function for the selected specific operator
Additional notes
- The Min operator for BoundedGauge metrics fills 0 for time points with no data or data as null
- Delay metrics ignore points with a value of 0, considering 0 as a meaningless delay value

#3.4 data_precision of Different Databases/Tables

Database	data_precision	Remarks
flow_metrics	1s/1m	Supports 1s and 1m by default, can be aggregated to 1h and 1d
flow_log	1s	No actual concept of `data_precision`, the value is for convenience in calculation
application_log	1s	No actual concept of `data_precision`, the value is for convenience in calculation
prometheus	10s	Can be modified through the `data_source_prometheus_interval` field in `server.yaml`
ext_metrics	10s	Can be modified through the `data_source_ext_metrics_interval` field in `server.yaml`
deepflow_admin	10s
deepflow_tenant	10s
event	1s	No actual concept of `data_precision`, the value is for convenience in calculation
profile	1s	No actual concept of `data_precision`, the value is for convenience in calculation