Calculation Logic for Metrics and Operators
This document was translated by ChatGPT
This document introduces the calculation logic for different types of metrics and operators.
#1. Metrics
Metrics are divided into two main categories: application performance metrics and network performance metrics.
#1.1 Application Performance Metrics
Application metrics are used to measure the performance of services during actual operation, focusing mainly on throughput, response latency, and exceptions. By collecting these metrics, operations and development teams can better understand how applications perform in real-world usage, identify potential performance issues, and take appropriate measures for optimization and improvement.
The metrics described below record one metric value for each statistical cycle. The statistical cycle can be customized by the user. The system currently supports 1m (one minute) and 1s (one second) by default (these data are collectively referred to as raw data sources in the DeepFlow platform). If multiple metric values are calculated within a single statistical cycle, they will be aggregated into one metric value. The aggregation logic is described later in the Type section.
#1.1.1 Throughput
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| request | Request | counter | ||
| response | Response | counter | ||
| response_ratio | Response % | % | percentage | |
| success_ratio | Success % | % | percentage |
generate from csv file: application.en?Category=Throughput
#1.1.2 Delay
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| rrt | Avg Delay | us | delay | |
| rrt_max | Max Delay | us | delay |
generate from csv file: application.en?Category=Delay
#1.1.3 Error
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| error | Error | counter | ||
| client_error | Client Error | counter | ||
| server_error | Server Error | counter | ||
| timeout | Timeout | counter | ||
| error_ratio | Error % | % | percentage | |
| client_error_ratio | Client Error % | % | percentage | |
| server_error_ratio | Server Error % | % | percentage | |
| timeout_ratio | Timeout % | % | percentage |
generate from csv file: application.en?Category=Error
#1.2 Network Performance Metrics
Network metrics are quantitative indicators used to evaluate network performance, covering the network layer, transport layer, and application layer. These metrics include throughput, latency, performance, and exception types.
#1.2.1 L3 Throughput
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| byte | Byte | Byte | counter | |
| byte_tx | Byte TX | Byte | counter | |
| byte_rx | Byte RX | Byte | counter | |
| packet | Packet | Packet | counter | |
| packet_tx | Packet TX | Packet | counter | |
| packet_rx | Packet RX | Packet | counter | |
| l3_byte | L3 Payload | Byte | counter | |
| l3_byte_tx | L3 Payload TX | Byte | counter | |
| l3_byte_rx | L3 Payload RX | Byte | counter | |
| bpp | Bytes per Packet | Byte | quotient | |
| bpp_tx | Bytes per Packet TX | Byte | quotient | |
| bpp_rx | Bytes per Packet RX | Byte | quotient |
generate from csv file: network.en?Category=L3 Throughput
#1.2.2 L4 Throughput
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| new_flow | New Flow | Flow | counter | |
| closed_flow | Closed Flow | Flow | counter | |
| flow_load | Active Flow | Flow | gauge | |
| syn_count | SYN Packet | Packet | counter | |
| synack_count | SYN-ACK Packet | Packet | counter | |
| l4_byte | L4 Payload | Byte | counter | |
| l4_byte_tx | L4 Payload TX | Byte | counter | |
| l4_byte_rx | L4 Payload RX | Byte | counter |
generate from csv file: network.en?Category=L4 Throughput
Active connection calculation logic:
- The collector counts the original number of active connections based on the quadruple (client IP, server IP, protocol, server port), and then calculates the active connections corresponding to resources and paths.
- If traffic is captured within the time interval of the data source, active connections are counted, but there are some special cases:
- 1s data source: describes the number of active connections counted per second
- First second of each minute: includes connections without traffic but not yet closed during that second, generally used to estimate concurrent connections (multiple non-overlapping connections lasting less than one second may cause some errors)
- Remaining 59 seconds of each minute: if multiple flows with the same quadruple have no traffic in that second, the connection count for that quadruple is ignored for that second, generally used to estimate the lower bound of concurrent connections
- 1m data source: describes the number of active connections counted per minute
- Includes connections without traffic but not yet closed, generally used to estimate the upper bound of concurrent connections
- Custom data source: calculated from 1s/1m data sources using Avg/Max/Min, with the same meaning as directly using the 1s/1m data source and selecting the Avg/Max/Min operator
- 1s data source: describes the number of active connections counted per second
#1.2.3 TCP Slow
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| retrans_syn | SYN Retransmission | Packet | counter | |
| retrans_synack | SYN-ACK Retransmission | Packet | counter | |
| retrans | TCP Retransmission | Packet | counter | |
| retrans_tx | TCP Client Retransmission | Packet | counter | |
| retrans_rx | TCP Server Retransmission | Packet | counter | |
| zero_win | TCP ZeroWindow | Packet | counter | |
| zero_win_tx | TCP Client ZeroWindow | Packet | counter | |
| zero_win_rx | TCP Server ZeroWindow | Packet | counter | |
| retrans_syn_ratio | SYN Retrans. % | % | percentage | |
| retrans_synack_ratio | SYN-ACK Retrans. % | % | percentage | |
| retrans_ratio | TCP Retrans. % | % | percentage | |
| retrans_tx_ratio | TCP Client Retrans. % | % | percentage | |
| retrans_rx_ratio | TCP Server Retrans. % | % | percentage | |
| zero_win_ratio | TCP ZeroWindow % | % | percentage | |
| zero_win_tx_ratio | TCP Client ZeroWindow % | % | percentage | |
| zero_win_rx_ratio | TCP Server ZeroWindow % | % | percentage |
generate from csv file: network.en?Category=TCP Slow
#1.2.4 TCP Error
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| tcp_establish_fail | Error | Flow | counter | |
| client_establish_fail | Client Error | Flow | counter | |
| server_establish_fail | Server Error | Flow | counter | |
| tcp_establish_fail_ratio | Error % | % | percentage | |
| client_establish_fail_ratio | Client Error % | % | percentage | |
| server_establish_fail_ratio | Client Error % | % | percentage | |
| tcp_transfer_fail | Transfer Error | Flow | counter | All transfer errors. |
| tcp_transfer_fail_ratio | Transfer Error % | % | percentage | |
| tcp_rst_fail | RST | Flow | counter | All RST errors. |
| tcp_rst_fail_ratio | RST % | % | percentage | |
| client_source_port_reuse | Est. - Client Port Reuse | Flow | counter | |
| server_syn_miss | Est. - Server SYN Miss | Flow | counter | |
| client_establish_other_rst | Est. - Client Other RST | Flow | counter | |
| client_ack_miss | Est. - Client ACK Miss | Flow | counter | |
| server_reset | Est. - Server Direct RST | Flow | counter | |
| server_establish_other_rst | Est. - Server Other RST | Flow | counter | |
| client_rst_flow | Transfer - Client RST | Flow | counter | |
| server_rst_flow | Transfer - Server RST | Flow | counter | |
| server_queue_lack | Transfer - Server Queue Overflow | Flow | counter | |
| tcp_timeout | Transfer - TCP Timeout | Flow | counter | |
| client_half_close_flow | Close - Client Half Close | Flow | counter | |
| server_half_close_flow | Close - Server Half Close | Flow | counter |
generate from csv file: network.en?Category=TCP Error
#1.2.4.1 TCP Client Connection Exceptions
- Client port reuse
- Phenomenon: The server receives SYN but does not reply with SYN-ACK, causing TCP connection failure
- Cause: Client source port conflicts with an already established TCP connection
- Recommendation:
- Check client TCP connection timeout parameters
- If there is a NAT device, check NAT rules
- Client ACK missing
- Phenomenon: The server replies with SYN-ACK, but the client does not respond, causing TCP connection failure
- Cause:
- Client SYN Flood attack
- Client port scanning
- Recommendation: Confirm whether it is a security incident and block the abnormal client in time
- Other client resets
- Phenomenon: The client sends SYN and then immediately sends RST, causing TCP connection failure
- Cause:
- Client application exception
- Malicious client attack
- Recommendation:
- Check client application status
- Check whether the client has general attack behavior

TCP Client Connection Exceptions
#1.2.4.2 TCP Server Connection Exceptions
- Server direct reset
- Phenomenon: The server receives SYN and replies with RST, rejecting TCP connection
- Cause:
- Server port not open or not listening
- Server application not ready
- Client port scanning
- Recommendation:
- Check server port connectivity
- Check whether the client is performing port scanning
- Server SYN missing
- Phenomenon: The client sends SYN multiple times, but the server does not respond
- Cause:
- Firewall not allowing the port
- Route unreachable
- Recommendation:
- Check firewall policy
- Check network connectivity
- Other server resets
- Phenomenon: The server sends SYN-ACK and then immediately sends RST, causing TCP connection failure
- Cause: Server operating system exception
- Recommendation: Check server operating system logs

TCP Server Connection Exceptions
#1.2.4.3 TCP Transmission Exceptions
- Server queue overflow
- Phenomenon: During TCP data transmission, the server sends SYN-ACK
- Cause: Server Accept queue overflow
- Recommendation:
- Adjust kernel somaxconn parameter
- Adjust kernel tcp_max_syn_backlog parameter
- Client reset
- Phenomenon: During TCP data transmission, the client sends RST to close the TCP connection
- Cause:
- Client application exception
- Client operating system exception
- Recommendation:
- Check client application status
- Check client operating system logs
- Server reset
- Phenomenon: During TCP data transmission, the server sends RST to close the TCP connection
- Cause:
- Server application exception
- Server operating system exception
- Recommendation:
- Check server application status
- Check server operating system logs
- TCP connection timeout
- Phenomenon: No data for more than 300 seconds during transmission
- Cause:
- Client host offline
- Client application exception
- Recommendation:
- Check client host status
- Check client application status

TCP Transmission Exceptions
#1.2.4.4 TCP Disconnection Exceptions
- Server half-close
- Phenomenon: The server receives FIN but does not reply with FIN-ACK, resulting in incomplete TCP four-way handshake
- Cause: Server application exception
- Recommendation: Check server application status
- Client half-close
- Phenomenon: The client receives FIN but does not reply with FIN-ACK, resulting in incomplete TCP four-way handshake
- Cause: Client application exception
- Recommendation: Check client application status

TCP Disconnection Exceptions
#1.2.5 Transport Layer Delay
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| rtt | Avg TCP Est. Delay | us | delay | |
| rtt_client | Avg TCP Est. Client Delay | us | delay | |
| rtt_server | Avg TCP Est. Server Delay | us | delay | |
| srt | Avg TCP/ICMP Response Delay | us | delay | |
| art | Avg Data Delay | us | delay | |
| cit | Avg Client Idle Delay | us | delay | |
| rtt_max | Max TCP Est. Delay | us | delay | |
| rtt_client_max | Max TCP Est. Client Delay | us | delay | |
| rtt_server_max | Max TCP Est. Server Delay | us | delay | |
| srt_max | Max TCP/ICMP Response Delay | us | delay | |
| art_max | Max Data Delay | us | delay | |
| cit_max | Max Client Idle Delay | us | delay |
generate from csv file: network.en?Category=Delay

TCP Network Delay Analysis
- Delay during connection establishment
- [1] Complete
connection establishment delayincludes the entire time from when the client sends a SYN packet to receiving the server's SYN+ACK packet and replying with an ACK packet. This can be further divided intoclient connection delayandserver connection delay - [2]
Client connection delayis the time from when the client receives the SYN+ACK packet to when it replies with an ACK packet - [3]
Server connection delayis the time from when the server receives the SYN packet to when it replies with a SYN+ACK packet
- [1] Complete
- Delay during data communication, which can be divided into
client wait delay+data transmission delay- [4]
Client wait delayis the time from successful connection establishment to when the client sends the first request; or the time from receiving a data packet from the server to when the client sends another data packet - [5]
Data transmission delayis the time from when the client sends a data packet to when it receives the server's reply - [6] Within data transmission delay, there is also processing delay in the system protocol stack, called
system delay, which is the time from receiving a data packet to receiving the ACK packet
- [4]
#1.2.6 Application Layer Metrics
| Field | DisplayName | Unit | Type | Description |
|---|---|---|---|---|
| l7_request | Request | counter | ||
| l7_response | Response | counter | ||
| rrt | Avg App. Delay | us | delay | |
| rrt_max | Max App. Delay | us | delay | |
| l7_error | App. Error | counter | ||
| l7_client_error | App. Client Error | counter | ||
| l7_server_error | App. Server Error | counter | ||
| l7_timeout | App. Server Timeout | counter | ||
| l7_error_ratio | App. Error % | % | percentage | |
| l7_client_error_ratio | App. Client Error % | % | percentage | |
| l7_server_error_ratio | App. Server Error % | % | percentage |
generate from csv file: network.en?Category=Application
#1.2.7 Cardinality
Within the statistical cycle, count the number of unique tags in the collected data. For example, querying the metric client IP address (ip_0) for all clients accessing pod_1 means counting how many unique client IP addresses appear in all traffic accessing pod_1.
| Field | DisplayName | Unit | Type | Description |
|---|
generate from csv file: network.en?Category=Cardinality
#2. Operators
Operators calculate data from raw data sources based on the selected time range and interval. For example, when using a line chart to view 1s raw data for the last 5 minutes with a 20s interval and Avg operator, a point at 14:43:00 reads all data from 14:42:40 to 14:43:00 in the raw data source and then calculates the average.
Operators support nested stacking, but aggregation operators do not support stacking. For example, PerSecond(Avg(byte)) means first calculating Avg(byte), then applying PerSecond to the result.
#2.1 Aggregation Operators
| Operator | English Name | Applicable Metric Type | Description |
|---|---|---|---|
| Avg | Average | All types | Average value (does not ignore zero values for Counter/Gauge metrics) |
| AAvg | Arithmetic Average | All types | Arithmetic average (average of averages at each time point) |
| Sum | Sum | Counter type | Sum |
| Max | Maximum | All types | Maximum value |
| Min | Minimum | All types | Minimum value |
| Percentile | Estimated Percentile | All types | Estimated percentile |
| PercentileExact | Exact Percentile | All types | Exact percentile |
| Spread | Spread | All types | Absolute spread: Max minus Min within the statistical cycle |
| Rspread | Relative Spread | All types | Relative spread: Max divided by Min within the statistical cycle |
| Stddev | Standard Deviation | All types | Standard deviation |
| Apdex | Application Performance Index | Delay type | Delay satisfaction index |
| Last | Last | All types | Latest value |
| Uniq | Estimated Uniq | Cardinality type | Estimated cardinality |
| UniqExact | Exact Uniq | Cardinality type | Exact cardinality |
#2.2 Secondary Operators
| Operator | Description |
|---|---|
| PerSecond | Calculates rate by dividing the inner operator result by the interval [1] |
| Math | Arithmetic operations: supports +, -, *, / |
| Percentage | Unit conversion to % |
- [1] For example:
PerSecond(Sum)means summing first, then dividing by the API-provided intervalinterval;PerSecond(Avg)means averaging first, then dividing by the data source intervaldata_precision.
#3. Operator Calculation Logic for Different Metrics
#3.1 Counter/Gauge Metrics
- flow_metrics tables
Sumoperator- Sum all data within the query time range
Avgoperator- Sum all data within the query time range, then divide by
interval/data_precision
- Sum all data within the query time range, then divide by
- Other operators
- First aggregate using
Sumbased ondata_precision - Then apply the selected operator using
ClickHousefunctions
- First aggregate using
- When forced (due to other metrics in the same query) to use two-layer SQL calculation
Sum/Avgoperator- First aggregate using
Sumbased ondata_precision - Then apply the selected operator using
ClickHousefunctions
- First aggregate using
- flow_log tables
- Apply the selected operator using
ClickHousefunctions
- Apply the selected operator using
- prometheus/ext_metrics/deepflow_system tables
- Same as flow_metrics tables
- Additional notes
Minoperator fills 0 for time points with no data or null values
#3.2 Quotient/Percentage Metrics
- flow_metrics tables
Avgoperator- Calculate
Sum(x)/Sum(y)for all data within the query time range
- Calculate
- Other operators
- First aggregate
Sum(x)/Sum(y)based ondata_precision - Then apply the selected operator using
ClickHousefunctions
- First aggregate
- When forced to use two-layer SQL calculation
Avgoperator- First aggregate
Sum(x)/Sum(y)based ondata_precision - Then apply the selected operator using
ClickHousefunctions
- First aggregate
- flow_log tables
- Apply the selected operator using
ClickHousefunctionfunc(x/y)
- Apply the selected operator using
- Additional notes
- For
Percentagemetrics, theMinoperator fills 0 for time points with no data - When calculating
Sum(x)/Sum(y), ignore points where the denominator is0/nullor the numerator isnull
- For
#3.3 Delay/BoundedGauge Metrics
- flow_metrics tables
- Apply the selected operator using
ClickHousefunctions - When forced to use two-layer SQL calculation
Avg/Min/Maxoperators- Both layers apply the selected operator using
ClickHousefunctions
- Both layers apply the selected operator using
Spread/Rspreadoperators- First aggregate using
MaxandMinbased ondata_precision - Then apply the selected operator using
ClickHousefunctions
- First aggregate using
- Other operators
- First aggregate using
groupArray - Then apply the selected operator using
ClickHousefunctions
- First aggregate using
- Apply the selected operator using
- flow_log tables
- Apply the selected operator using
ClickHousefunctions
- Apply the selected operator using
- Additional notes
- For
BoundedGaugemetrics, theMinoperator fills 0 for time points with no data or null values - For
Delaymetrics, ignore points with value 0, as 0 is considered a meaningless delay value
- For
#3.4 data_precision for Different Databases/Tables
| Database | data_precision | Notes |
|---|---|---|
| flow_metrics | 1s/1m | Supports 1s and 1m by default, can be aggregated to 1h, 1d |
| flow_log | 1s | No actual data_precision concept, value is for calculation purposes |
| application_log | 1s | No actual data_precision concept, value is for calculation purposes |
| prometheus | 10s | Can be modified via data_source_prometheus_interval in server.yaml |
| ext_metrics | 10s | Can be modified via data_source_ext_metrics_interval in server.yaml |
| deepflow_admin | 10s | |
| deepflow_tenant | 10s | |
| event | 1s | No actual data_precision concept, value is for calculation purposes |
| profile | 1s | No actual data_precision concept, value is for calculation purposes |