Metrics and Operators Calculation Logic

Created:2024-06-01 Last Modified:2024-06-24

This document was translated by ChatGPT

This article will introduce different types of metrics and the calculation logic of various operators.

#1. Metrics

Metrics are divided into two main categories: Application Performance Metrics and Network Performance Metrics.

#1.1 Application Performance Metrics

Application metrics are used to measure the performance of services during actual operation, focusing mainly on service throughput, response delay, and anomalies. By collecting these metrics, operations personnel and developers can better understand the performance of applications in real-world usage, identify potential performance issues, and take appropriate measures for optimization and improvement.

The metrics described below will record a metric value in each statistical cycle, which can be customized by the user. The system currently supports 1m (one minute) and 1s (one second) by default (these data are collectively referred to as raw data sources in the DeepFlow platform). If multiple metric values are calculated within a statistical cycle, they will be aggregated into one metric value. The aggregation logic is described in the subsequent Types section.

#1.1.1 Throughput

Field DisplayName Unit Type Description
request Request counter
response Response counter

generate from csv file: application.en?Category=Throughput

#1.1.2 Delay

Field DisplayName Unit Type Description
rrt Avg Delay us delay
rrt_max Max Delay us delay

generate from csv file: application.en?Category=Delay

#1.1.3 Error

Field DisplayName Unit Type Description
error Error counter
client_error Client Error counter
server_error Server Error counter
timeout Timeout counter
error_ratio Error % % percentage
client_error_ratio Client Error % % percentage
server_error_ratio Server Error % % percentage

generate from csv file: application.en?Category=Error

#1.2 Network Performance Metrics

Network metrics are quantitative indicators used to evaluate network performance, covering the network layer, transport layer, and application layer. These metrics include throughput, delay, performance, and anomaly types.

#1.2.1 L3 Throughput

Field DisplayName Unit Type Description
byte Byte Byte counter
byte_tx Byte TX Byte counter
byte_rx Byte RX Byte counter
packet Packet Packet counter
packet_tx Packet TX Packet counter
packet_rx Packet RX Packet counter
l3_byte L3 Payload Byte counter
l3_byte_tx L3 Payload TX Byte counter
l3_byte_rx L3 Payload RX Byte counter
bpp Bytes per Packet Byte quotient
bpp_tx Bytes per Packet TX Byte quotient
bpp_rx Bytes per Packet RX Byte quotient

generate from csv file: network.en?Category=L3 Throughput

#1.2.2 L4 Throughput

Field DisplayName Unit Type Description
new_flow New Flow Flow counter
closed_flow Closed Flow Flow counter
flow_load Active Flow Flow gauge
syn_count SYN Packet Packet counter
synack_count SYN-ACK Packet Packet counter
l4_byte L4 Payload Byte counter
l4_byte_tx L4 Payload TX Byte counter
l4_byte_rx L4 Payload RX Byte counter

generate from csv file: network.en?Category=L4 Throughput

Active connection calculation logic:

  • The collector counts the raw active connections based on the quadruple (client IP, server IP, protocol, server port) and then calculates the active connections corresponding to resources and paths.
  • If traffic is collected within the time interval corresponding to the data source, active connections are counted, but there are some special cases:
    • 1s data source: Describes the active connections counted per second.
      • The first second of each minute: Includes connections that have no traffic within that second but have not ended, generally used to evaluate concurrent connections (multiple non-overlapping connections with a duration of less than one second may introduce some errors).
      • The last 59 seconds of each minute: If multiple flows with the same quadruple have no traffic within that second, the connections corresponding to that quadruple will be ignored for that second, generally used to evaluate the lower bound of concurrent connections.
    • 1m data source: Describes the active connections counted per minute.
      • Includes connections that have no traffic but have not ended, generally used to evaluate the upper bound of concurrent connections.
    • Custom data source: Calculated based on 1s/1m data sources using Avg/Max/Min, with the same meaning as directly using 1s/1m data sources and selecting Avg/Max/Min operators.

#1.2.3 TCP Slow

Field DisplayName Unit Type Description
retrans_syn SYN Retransmission Packet counter
retrans_synack SYN-ACK Retransmission Packet counter
retrans TCP Retransmission Packet counter
retrans_tx TCP Client Retransmission Packet counter
retrans_rx TCP Server Retransmission Packet counter
zero_win TCP ZeroWindow Packet counter
zero_win_tx TCP Client ZeroWindow Packet counter
zero_win_rx TCP Server ZeroWindow Packet counter
retrans_syn_ratio SYN Retrans. % % percentage
retrans_synack_ratio SYN-ACK Retrans. % % percentage
retrans_ratio TCP Retrans. % % percentage
retrans_tx_ratio TCP Client Retrans. % % percentage
retrans_rx_ratio TCP Server Retrans. % % percentage
zero_win_ratio TCP ZeroWindow % % percentage
zero_win_tx_ratio TCP Client ZeroWindow % % percentage
zero_win_rx_ratio TCP Server ZeroWindow % % percentage

generate from csv file: network.en?Category=TCP Slow

#1.2.4 TCP Error

Field DisplayName Unit Type Description
tcp_establish_fail Error Flow counter
client_establish_fail Client Error Flow counter
server_establish_fail Server Error Flow counter
tcp_establish_fail_ratio Error % % percentage
client_establish_fail_ratio Client Error % % percentage
server_establish_fail_ratio Client Error % % percentage
tcp_transfer_fail Transfer Error Flow counter All transfer errors.
tcp_transfer_fail_ratio Transfer Error % % percentage
tcp_rst_fail RST Flow counter All RST errors.
tcp_rst_fail_ratio RST % % percentage
client_source_port_reuse Est. - Client Port Reuse Flow counter
server_syn_miss Est. - Server SYN Miss Flow counter
client_establish_other_rst Est. - Client Other RST Flow counter
client_ack_miss Est. - Client ACK Miss Flow counter
server_reset Est. - Server Direct RST Flow counter
server_establish_other_rst Est. - Server Other RST Flow counter
client_rst_flow Transfer - Client RST Flow counter
server_rst_flow Transfer - Server RST Flow counter
server_queue_lack Transfer - Server Queue Overflow Flow counter
tcp_timeout Transfer - TCP Timeout Flow counter
client_half_close_flow Close - Client Half Close Flow counter
server_half_close_flow Close - Server Half Close Flow counter

generate from csv file: network.en?Category=TCP Error

#1.2.4.1 TCP Connection Errors

TCP 建连异常

TCP 建连异常

#1.2.4.2 TCP Transmission Errors

TCP 传输异常

TCP 传输异常

#1.2.5 Delay

Field DisplayName Unit Type Description
rtt Avg TCP Est. Delay us delay
rtt_client Avg TCP Est. Client Delay us delay
rtt_server Avg TCP Est. Server Delay us delay
srt Avg TCP/ICMP Response Delay us delay
art Avg Data Delay us delay
cit Avg Client Idle Delay us delay
rtt_max Max TCP Est. Delay us delay
rtt_client_max Max TCP Est. Client Delay us delay
rtt_server_max Max TCP Est. Server Delay us delay
srt_max Max TCP/ICMP Response Delay us delay
art_max Max Data Delay us delay
cit_max Max Client Idle Delay us delay

generate from csv file: network.en?Category=Delay

TCP 网络时延解剖

TCP 网络时延解剖

  • Delay generated during connection establishment
    • [1] The complete connection establishment delay includes the entire time from the client sending the SYN packet to receiving the SYN+ACK packet from the server and then replying with an ACK packet. The connection establishment delay can be further divided into client connection establishment delay and server connection establishment delay.
    • [2] Client connection establishment delay is the time taken for the client to reply with an ACK packet after receiving the SYN+ACK packet.
    • [3] Server connection establishment delay is the time taken for the server to reply with a SYN+ACK packet after receiving the SYN packet.
  • Delay generated during data communication can be divided into client waiting delay + data transmission delay.
    • [4] Client waiting delay is the time taken for the client to send the first request after the connection is successfully established; it is also the time taken for the client to send a data packet after receiving a data packet from the server.
    • [5] Data transmission delay is the time taken for the client to send a data packet and receive a reply data packet from the server.
    • [6] During data transmission delay, there is also a delay generated by the system protocol stack, called system delay, which is the time taken for the data packet to receive an ACK packet.

#1.2.6 Application

Field DisplayName Unit Type Description
l7_request Request counter
l7_response Response counter
rrt Avg App. Delay us delay
rrt_max Max App. Delay us delay
l7_error App. Error counter
l7_client_error App. Client Error counter
l7_server_error App. Server Error counter
l7_timeout App. Server Timeout counter
l7_error_ratio App. Error % % percentage
l7_client_error_ratio App. Client Error % % percentage
l7_server_error_ratio App. Server Error % % percentage

generate from csv file: network.en?Category=Application

#1.2.7 Cardinality

During the statistical cycle, the number of unique tags collected is counted. For example, querying the client IP address (ip_0) metric for all accesses to pod_1 means counting the number of unique client IP addresses in all traffic accessing pod_1.

Field DisplayName Unit Type Description

generate from csv file: network.en?Category=Cardinality

#2. Operators

Operators calculate data from raw data sources based on the selected time range and interval. For example, using a line chart to view 1s raw data sources for the last 5 minutes with a 20s interval, a point on the line chart (14:43:00) would read all data within the time range of 14:42:40 - 14:43:00 and then calculate the average value.

Operators support nested stacking, but aggregate operators do not support stacking. For example, PerSecond(Avg(byte)) means calculating Avg(byte) first, and then the resulting value is recalculated based on PerSecond.

#2.1 Aggregate Operators

Operator English Name Applicable Metric Types Description
Avg Average All types Average value (does not ignore zero values for Counter/Gauge metrics)
AAvg Arithmetic Average All types Arithmetic average (first calculate the average at each time point, then calculate the average of the averages)
Sum Sum Counter type Sum
Max Maximum All types Maximum value
Min Minimum All types Minimum value
Percentile Estimated Percentile All types Estimated percentile
PercentileExact Exact Percentile All types Exact percentile
Spread Spread All types Absolute spread, Max minus Min within the statistical cycle
Rspread Relative Spread All types Relative spread, Max divided by Min within the statistical cycle
Stddev Standard Deviation All types Standard deviation
Apdex Application Performance Index Delay type Delay satisfaction
Last Last All types Latest value
Uniq Estimated Uniq Cardinality type Estimated cardinality
UniqExact Exact Uniq Cardinality type Exact cardinality

#2.2 Secondary Operators

Operator Description
PerSecond Calculate rate, divide the result of the inner operator by the time interval [1]
Math Arithmetic operations, supports +, -, *, /
Percentage Unit conversion %
  • [1] For example: PerSecond(Sum) means calculating the sum first, then dividing by the time interval interval passed by the API; PerSecond(Avg) means calculating the average first, then dividing by the data source time interval data_precision.

#3. Calculation Logic of Different Metrics' Operators

#3.1 Counter/Gauge Metrics

  • flow_metrics data table
    • Sum operator
      • Calculate the Sum of all data within the query time range
    • Avg operator
      • Calculate the Sum of all data within the query time range and divide by interval/data_precision
    • Other operators
      • First use Sum to aggregate based on data_precision
      • Then call the ClickHouse function for the selected specific operator
    • When forced (due to the need for other metrics in the same statement) to use two layers of SQL calculations
      • Sum/Avg operator
        • First use Sum to aggregate based on data_precision
        • Then call the ClickHouse function for the selected specific operator
  • flow_log data table
    • Call the ClickHouse function for the selected specific operator
  • prometheus/ext_metrics/deepflow_system data table
    • Same as flow_metrics data table
  • Additional notes
    • The Min operator fills 0 for time points with no data or data as null

#3.2 Quotient/Percentage Metrics

  • flow_metric data table
    • Avg operator
      • Calculate Sum(x)/Sum(y) for all data within the query time range
    • Other operators
      • First use Sum(x)/Sum(y) to aggregate based on data_precision
      • Then call the ClickHouse function for the selected specific operator
    • When forced (due to the need for other metrics in the same statement) to use two layers of SQL calculations
      • Avg operator
        • First use Sum(x)/Sum(y) to aggregate based on data_precision
        • Then call the ClickHouse function for the selected specific operator
  • flow_log data table
    • Call the ClickHouse function func(x/y) for the selected specific operator
  • Additional notes
    • The Min operator for Percentage metrics fills 0 for time points with no data
    • When calculating Sum(x)/Sum(y), points with a denominator of 0/null or a numerator of null are ignored

#3.3 Delay/BoundedGauge Metrics

  • flow_metric data table
    • Call the ClickHouse function for the selected specific operator
    • When forced (due to the need for other metrics in the same statement) to use two layers of SQL calculations
      • Avg/Min/Max operator
        • Both layers call the ClickHouse function for the selected specific operator
      • Spread/Rspread operator
        • First use Max and Min to aggregate based on data_precision
        • Then call the ClickHouse function for the selected specific operator
      • Other operators
        • First use groupArray to aggregate
        • Then call the ClickHouse function for the selected specific operator
  • flow_log data table
    • Call the ClickHouse function for the selected specific operator
  • Additional notes
    • The Min operator for BoundedGauge metrics fills 0 for time points with no data or data as null
    • Delay metrics ignore points with a value of 0, considering 0 as a meaningless delay value

#3.4 data_precision of Different Databases/Tables

Database data_precision Remarks
flow_metrics 1s/1m Supports 1s and 1m by default, can be aggregated to 1h and 1d
flow_log 1s No actual concept of data_precision, the value is for convenience in calculation
application_log 1s No actual concept of data_precision, the value is for convenience in calculation
prometheus 10s Can be modified through the data_source_prometheus_interval field in server.yaml
ext_metrics 10s Can be modified through the data_source_ext_metrics_interval field in server.yaml
deepflow_admin 10s
deepflow_tenant 10s
event 1s No actual concept of data_precision, the value is for convenience in calculation
profile 1s No actual concept of data_precision, the value is for convenience in calculation