Calculation Logic of Metrics and Operators

This document was translated by GPT-4

This article will introduce different types of metrics and the calculation logic of different operators.

# 1. Metrics

Metrics are divided into two main categories: application performance metrics and network performance metrics.

# 1.1 Application Performance Metrics

Application metrics are used to measure the performance of services in actual operation, focusing mainly on service throughput, response latency, and exceptions. By quantifying these metrics, operations personnel and developers can better understand the performance of the application during actual use and identify potential performance issues, thereby taking appropriate measures for optimization and improvement.

The metrics described below will record a metric quantity in each statistical period. Users can customize the statistical period, and the system currently supports 1m (one minute) and 1s (one second) (these data are collectively referred to as the original data source by the DeepFlow platform). If multiple metric quantities are calculated in a statistical period, they will finally be aggregated and recorded as one metric quantity. The logic for aggregation is described in the subsequent description of the type.

# 1.1.1 Throughput

Field DisplayName Unit Type Description
request 请求 counter 请求总数
response 响应 counter 响应总数

generate from csv file: application.ch?Category=Throughput

# 1.1.2 Delay

Field DisplayName Unit Type Description
rrt 平均时延 微秒 delay 采集周期内所有应用时延的平均值,单次应用时延等于响应与请求的时间差
rrt_max 最大时延 微秒 delay 采集周期内所有应用时延的最大值,单次应用时延等于响应与请求的时间差

generate from csv file: application.ch?Category=Delay

# 1.1.3 Error

Field DisplayName Unit Type Description
error 异常 counter 客户端异常 + 服务端异常
client_error 客户端异常 counter 根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_logresponse_status 字段的说明
server_error 服务端异常 counter 根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_logresponse_status 字段的说明
timeout 超时 counter 应用超时的统计次数(默认配置下:TCP 类应用在 1800s 内未采集到响应,UDP 类应用在 150s 内未采集到响应)
error_ratio 异常比例 % percentage 异常 / 响应
client_error_ratio 客户端异常比例 % percentage 客户端异常 / 响应
server_error_ratio 服务端异常比例 % percentage 服务端异常 / 响应

generate from csv file: application.ch?Category=Error

# 1.2 Network Performance Metrics

Network metrics are quantitative indicators used to assess network performance, covering the network layer, transport layer, and application layer. These metrics include throughput, latency, performance, and types of exceptions.

# 1.2.1 L3 Throughput

Field DisplayName Unit Type Description
byte 字节 字节 counter 发送字节 + 接收字节
byte_tx 发送字节 字节 counter 资源发送的字节数总和(含 Ethernet 头)
byte_rx 接收字节 字节 counter 资源接收的字节数总和(含 Ethernet 头)
packet 包数 counter 发送包数 + 接收包数
packet_tx 发送包数 counter 资源发送的包数总和
packet_rx 接收包数 counter 资源接收的包数总和
l3_byte 网络层载荷 字节 counter 发送网络层载荷 + 接收网络层载荷
l3_byte_tx 发送网络层载荷 字节 counter 资源发送的网络层载荷字节数总和(不含 IP 头)
l3_byte_rx 接收网络层载荷 字节 counter 资源接收的网络层载荷字节数总和(不含 IP 头)
bpp 平均包长 字节 quotient 字节 / 包数
bpp_tx 平均发送包长 字节 quotient 发送字节 / 发送包数
bpp_rx 平均接收包长 字节 quotient 接收字节 / 接收包数

generate from csv file: network.ch?Category=L3 Throughput

# 1.2.2 L4 Throughput

Field DisplayName Unit Type Description
new_flow 新建连接 连接 counter 采集周期内新建的 TCP 连接数,连接的定义详见文档
closed_flow 关闭连接 连接 counter 采集周期内关闭的 TCP 连接数,连接的定义详见文档
flow_load 活跃连接 连接 gauge 采集周期内活跃的连接数,包括有数据交互的长连接、无数据交互的长连接、周期内关闭的短连接,连接的定义详见文档
syn_count SYN 包数 counter SYN 包的总数
synack_count SYN-ACK 包数 counter SYN-ACK 包的总数
l4_byte 传输层载荷 字节 counter 发送传输层载荷 + 接收传输层载荷
l4_byte_tx 发送传输层载荷 字节 counter 资源发送的包传输层载荷字节数总和(不含 TCP/UDP 头)
l4_byte_rx 接收传输层载荷 字节 counter 资源接收的包传输层载荷字节数总和(不含 TCP/UDP 头)

generate from csv file: network.ch?Category=L4 Throughput

Active connection calculation logic:

  • The collector counts the original active connections on the unit of four-tuples (client IP, server IP, protocol, server port), and then calculates the active connections corresponding to resources and paths.
  • If traffic can be collected within the time interval corresponding to the data source, the active connections will be counted; however, there are some special situations:
    • 1s data source: describes the number of active connections counted per second.
      • Each minute, the first second: Includes connections that have no traffic but have not ended within this second. This is generally used to evaluate concurrent connections (many non-overlapping connections with a duration of less than a second can cause some errors).
      • Each minute, the last 59 seconds: If multiple flows with the same four-tuple have no traffic within this second, this second will ignore the number of connections corresponding to this four-tuple. This is generally used to evaluate the lower limit of concurrent connections.
    • 1m data source: describes the number of active connections counted per minute.
      • Includes connections that have no traffic but have not ended. This is generally used to evaluate the upper limit of concurrent connections.
    • Custom data sources: derived from 1s/1m data sources through Avg/Max/Min calculation. The meaning is the same as directly using the 1s/1m data source and choosing the Avg/Max/Min operator.

# 1.2.3 TCP Performance (TCP Slow)

Field DisplayName Unit Type Description
retrans_syn SYN 重传 counter SYN 包的重传次数
retrans_synack SYN-ACK 重传 counter SYN-ACK 包的重传次数
retrans TCP 重传 counter TCP 客户端重传 + TCP 服务端重传
retrans_tx TCP 客户端重传 counter 资源发送的 TCP 重传包次数
retrans_rx TCP 服务端重传 counter 资源接收的 TCP 重传包次数
zero_win TCP 零窗 counter TCP 客户端零窗 + TCP 服务端零窗
zero_win_tx TCP 客户端零窗 counter 资源发送的 TCP 零窗包次数
zero_win_rx TCP 服务端零窗 counter 资源接收的 TCP 零窗包次数
retrans_syn_ratio SYN 重传比例 % percentage TCP SYN 重传 / TCP SYN 包数
retrans_synack_ratio SYN-ACK 重传比例 % percentage TCP SYN-ACK 重传 / TCP SYN-ACK 包数
retrans_ratio TCP 重传比例 % percentage TCP 重传 / 包数
retrans_tx_ratio TCP 客户端重传比例 % percentage TCP 客户端重传 / 发送包数
retrans_rx_ratio TCP 服务端重传比例 % percentage TCP 服务端重传 / 接收包数
zero_win_ratio TCP 零窗比例 % percentage TCP 零窗 / 包数
zero_win_tx_ratio TCP 客户端零窗比例 % percentage TCP 客户端零窗 / 发送包数
zero_win_rx_ratio TCP 服务端零窗比例 % percentage TCP 服务端零窗 / 接收包数

generate from csv file: network.ch?Category=TCP Slow

# 1.2.4 TCP Exceptions (TCP Error)

Field DisplayName Unit Type Description
tcp_establish_fail 建连-失败次数 counter 建连-客户端失败次数 + 建连-服务端失败次数
client_establish_fail 建连-客户端失败次数 counter 建连-客户端端口复用 + 建连-客户端其他重置 + 建连-客户端 ACK 缺失
server_establish_fail 建连-服务端失败次数 counter 建连-服务端 SYN 缺失 + 建连-服务端直接重置 + 建连-服务端其他重置
tcp_establish_fail_ratio 建连-失败比例 % percentage 建连-失败次数 / 关闭连接
client_establish_fail_ratio 建连-客户端失败比例 % percentage 建连-客户端失败次数 / 关闭连接
server_establish_fail_ratio 建连-服务端失败比例 % percentage 建连-服务端失败次数 / 关闭连接
tcp_transfer_fail 传输-失败次数 counter 传输-客户端重置 + 传输-服务端重置 + 传输-服务端队列溢出 + 传输-TCP 连接超时
tcp_transfer_fail_ratio 传输-失败比例 % percentage 传输-失败次数 / 关闭连接
tcp_rst_fail 重置次数 连接 counter 建连-客户端其他重置 + 建连-服务端直接重置 + 建连-服务端其他重置 + 传输-客户端重置 + 传输-服务端重置
tcp_rst_fail_ratio 重置比例 % percentage 重置次数 / 关闭连接
client_source_port_reuse 建连-客户端端口复用 连接 counter TCP 建连失败的场景之一,见文档描述
server_syn_miss 建连-服务端 SYN 缺失 连接 counter TCP 建连失败的场景之一,见文档描述
client_establish_other_rst 建连-客户端其他重置 连接 counter TCP 建连失败的场景之一,见文档描述
client_ack_miss 建连-客户端 ACK 缺失 连接 counter TCP 建连失败的场景之一,见文档描述
server_reset 建连-服务端直接重置 连接 counter TCP 建连失败的场景之一,见文档描述
server_establish_other_rst 建连-服务端其他重置 连接 counter TCP 建连失败的场景之一,见文档描述
client_rst_flow 传输-客户端重置 连接 counter TCP 传输失败的场景之一,见文档描述
server_rst_flow 传输-服务端重置 连接 counter TCP 传输失败的场景之一,见文档描述
server_queue_lack 传输-服务端队列溢出 连接 counter TCP 传输失败的场景之一,见文档描述
tcp_timeout 传输-TCP 连接超时 连接 counter TCP 传输失败的场景之一,见文档描述
client_half_close_flow 断连-客户端半关 连接 counter TCP 断连异常的场景之一,见文档描述
server_half_close_flow 断连-服务端半关 连接 counter TCP 断连异常的场景之一,见文档描述

generate from csv file: network.ch?Category=TCP Error

TCP client Connection exceptions

TCP client Connection exceptions

TCP client Connection exceptions

TCP server Connection exceptions

TCP server Connection exceptions

TCP server Connection exceptions

TCP Transfer exceptions

TCP Transfer exceptions

TCP Transfer exceptions

TCP Disconnection exceptions

TCP Disconnection exceptions

TCP Disconnection exceptions

TCP Connection timeouts

TCP Connection timeouts

TCP Connection timeouts

# 1.2.5 Transport Layer Delay (Delay)

Field DisplayName Unit Type Description
rtt 平均 TCP 建连时延 微秒 delay 采集周期内,所有 TCP 建连时延的平均值,单次时延的计算见文档描述
rtt_client 平均 TCP 建连客户端时延 微秒 delay 采集周期内,所有 TCP 建连客户端时延的平均值,单次时延的计算见文档描述
rtt_server 平均 TCP 建连服务端时延 微秒 delay 采集周期内,所有 TCP 建连服务端时延的平均值,单次时延的计算见文档描述
srt 平均 TCP/ICMP 系统时延 微秒 delay 采集周期内,所有 TCP/ICMP 系统时延的平均值,单次时延的计算见文档描述
art 平均数据时延 微秒 delay 采集周期内,所有数据时延的平均值,数据时延包含 TCP/UDP,单次时延的计算见文档描述
cit 平均客户端等待时延 微秒 delay 采集周期内,所有客户端等待时延的平均值,数据时延仅包含 TCP,单次时延的计算见文档描述
rtt_max 最大 TCP 建连时延 微秒 delay 采集周期内,所有 TCP 建连时延的最大值,单次时延的计算见文档描述
rtt_client_max 最大 TCP 建连客户端时延 微秒 delay 采集周期内,所有 TCP 建连客户端时延的最大值,单次时延的计算见文档描述
rtt_server_max 最大 TCP 建连服务端时延 微秒 delay 采集周期内,所有 TCP 建连服务端时延的最大值,单次时延的计算见文档描述
srt_max 最大 TCP/ICMP 系统时延 微秒 delay 采集周期内,所有 TCP/ICMP 系统时延的最大值,单次时延的计算见文档描述
art_max 最大数据时延 微秒 delay 采集周期内,所有数据时延的最大值,数据时延包含 TCP/UDP,单次时延的计算见文档描述
cit_max 最大客户端等待时延 微秒 delay 采集周期内,所有客户端等待时延的最大值,数据时延仅包含 TCP,单次时延的计算见文档描述

generate from csv file: network.ch?Category=Delay

TCP network delay dissection

TCP network delay dissection

  • Delay caused during connection establishment
    • [1] The complete connection establishment delay includes the entire time from the client sending a SYN packet to receiving the SYN+ACK packet replied by the server, and then replying with an ACK packet. The connection establishment delay can be further divided into the client connection establishment delay and the server connection establishment delay.
    • [2] The client connection establishment delay is the time for the client to reply with an ACK packet after receiving the SYN+ACK packet.
    • [3] The server connection establishment delay is the time for the server to reply with a SYN+ACK packet after receiving the SYN packet.
  • Delay generated during data communication, which can be broken down into client wait delay + data transfer delay.
    • [4] The client wait delay is the time for the client to first send a request after a successful connection; it is the time for the client to send another data packet after receiving the server's data packet.
    • [5] The data transfer delay is the time from the client sending a data packet to receiving a reply data packet from the server.
    • [6] During the data transfer delay, there will be system protocol stack processing delays, called system delay, which is the time for the data packet to receive the ACK packet.

# 1.2.6 Application Layer Metrics (Application)

Field DisplayName Unit Type Description
l7_request 应用请求 counter 应用层协议请求次数
l7_response 应用响应 counter 应用层协议响应次数
rrt 平均应用时延 微秒 delay 采集周期内,所有应用时延的平均值,单次应用时延等于响应与请求的时间差
rrt_max 最大应用时延 微秒 delay 采集周期内,所有应用时延的最大值,单次应用时延等于响应与请求的时间差
l7_error 应用异常 counter 应用客户端异常 + 应用服务端异常
l7_client_error 应用客户端异常 counter 根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_log 中 response_status 字段的说明
l7_server_error 应用服务端异常 counter 根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_log 中 response_status 字段的说明
l7_timeout 应用超时 counter 应用超时的统计次数(默认配置下:TCP 类应用在 1800s 内未采集到响应,UDP 类应用在 150s 内未采集到响应)
l7_error_ratio 应用异常比例 % percentage 应用异常 / 应用响应
l7_client_error_ratio 应用客户端异常比例 % percentage 应用客户端异常 / 应用响应
l7_server_error_ratio 应用服务端异常比例 % percentage 应用服务端异常 / 应用响应

generate from csv file: network.ch?Category=Application

# 1.2.7 Cardinality Statistics (Cardinality)

The number of non-repeated tags counted in the statistical period. For example, if you query the metric "client IP address (ip_0)" that all access to pod_1, the expression implies how many non-repeated client IP addresses are there in all the traffic visiting pod_1.

Field DisplayName Unit Type Description

generate from csv file: network.ch?Category=Cardinality

# 2. Operators

Operators compute the data in the original data source according to the selected time range and interval. For example, when using a line chart to view the original data source of 1s, the latest 5 minutes, according to 20s time interval Avg data, taking one point on the line chart as an example (14:43:00), it is to read all the data in the time range of 14:42:40 - 14:43:00 in the original data source, and then calculate the average.

Operators support nested stacking. Among them, aggregation operators do not support stacking. For example, the expression PerSecond(Avg(byte)) means to calculate Avg(byte) first, and then the obtained value is secondarily calculated according to PerSecond.

# 2.1 Aggregation Operators

Operator Applicable Metric Types Description
Avg All types Average
Sum All types except Percentage Sum
Max All types Maximum
Min All types Minimum
Percentile All types Estimated Percentile
PercentileExact All types Exact Percentile
Spread All types Absolute span, the period of statistics inside, Max minus Min
Rspread All types Relative span, the period of statistics inside, Max divided by Min
Stddev All types Standard deviation
Apdex Delay type Latency Satisfaction Index
Last All types Most recent value
Uniq Cardinality type Estimated cardinality statistics
UniqExact Cardinality type Accurate cardinality statistics

# 2.2 Secondary Operators

Operator Description
PerSecond Calculates the rate, dividing the metric quantity by the time interval (in seconds)
Math Arithmetic operations, supports +, -, *, /
Percentage Unit conversion %

# 3. Operator Calculation Logic for Different Metrics

# 3.1 Counter/Gauge Type Metrics

  • Flow_metric data table:
    • First use Sum to aggregate according to data_precision.
    • Then use the specific operator selected to call the ClickHouse function to calculate.
  • Flow_log data table:
    • Use the specific operator selected to call the ClickHouse function to calculate.

# 3.2 Quotient/Percentage Type Metrics

  • Flow_metric data table:
    • First use Sum(x)/Sum(y) to aggregate according to data_precision.
    • Then call the ClickHouse function to calculate according to the specific operator selected.
  • Flow_log data table:
    • Use the specific operator selected to call the ClickHouse function func(x/y) to perform calculation.

# 3.3 Delay Type Metrics

  • Flow_metric data table:
    • Use the specific operator selected to call the ClickHouse function to calculate.
  • Flow_log data table:
    • Use the specific operator selected to call the ClickHouse function to calculate.