用自动识别破解代码监视难题
#python #monitoring #observability #prometheus

在系统监视和可观察性领域,日志提出了两个突出的挑战:可理解性和成本效益。当系统运行时,筛选大量收集的日志以找到相关信息可能是一项艰巨的任务。庞大的原木通常会产生过多的噪音,因此很难辨别有意义的信号。许多供应商提供服务,以从日志数据中提取有价值的见解,尽管费用可观。日志的处理和存储带有明显的价格标签。

作为替代方案,指标成为许多团队求助的可观察性工具包中的关键组成部分。通过关注定量测量,与处理大量文本数据相比,指标的优势在于其成本较低。此外,指标提供了对系统性能的简洁明了的全面理解。

尽管如此,利用指标为开发人员构成了自己的挑战,他们必须遇到以下障碍:

  1. 确定要跟踪和选择适当的度量类型的指标,例如计数器或直方图,通常会带来困境。
  2. 努力在Promql或类似查询语言中撰写查询以获取所需数据的复杂性可能是一项艰巨的任务。
  3. 确保检索到的数据有效地解决了预期的问题需要细致的验证。

这些障碍强调了开发人员在利用指标进行系统监视和分析时遇到的困难。


自动透镜通过在功能级别实现装饰器和包装器,提供了一种简化的方法来增强代码级别的可观察性。该框架采用标准化指标名称和指标的一致标记或标签方案。

目前,自动识别使用三个基本指标:

  1. function.calls.count:该计数器度量有效地监视了请求和错误的速率。
  2. function.calls.duration:一种旨在跟踪延迟的直方图指标,可提供对性能的见解。
  3. (可选)function.calls.concurrent:该量规指标,如果使用,则可以跟踪并发请求,提供有价值的并发信息。

此外,自动识别的所有三个指标都包括以下标签:

  • function:此标签标识了要监视的特定功能。
  • module:此标签指定与函数关联的模块。

此外,对于function.calls.count衡量标准,还包含以下标签:

  • caller:此标签表示函数的呼叫者。
  • result:此标签捕获了功能的结果或结果。

有很多有用的features支持以帮助建立一个基于指标的可观察性平台。


演示时间

让我们仪器一个简单的python flask滚动骰子申请。该应用程序在localhost上运行,并在port 23456

上暴露
预先提取
  • 安装和配置Prometheus
❯ cat prometheus.yml
scrape_configs:
  - job_name: my-app
    metrics_path: /metrics
    static_configs:
    # host.docker.internal is to refer the localhost which will run the flask app
      - targets: ['host.docker.internal:23456']
    scrape_interval: 20s

❯ docker run \
    --name=prometheus \
    -d \
    -p 9090:9090 \
    -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus

❯ curl localhost:9090
<a href="/graph">Found</a>.
  • 创建样品瓶应用程序
 cat app.py
from flask import Flask, jsonify, Response
import random

app = Flask(__name__)

@app.route('/')
def home():
    return 'Welcome to the Dice Rolling API!'

@app.route('/roll_dice')
@autometrics
def roll_dice():
    roll = random.randint(1, 6)
    response = {
        'roll': roll
    }
    return jsonify(response)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=23456, debug=True)
烧瓶应用程序的仪器
  • 安装库
    pip install autometrics

  • 创建一个指向Prometheus端点的.env文件

cat .env
PROMETHEUS_URL=http://localhost:9090/
  • 更新脚本以添加@autometrics装饰器并在/metrics上公开指标
 cat app.py
from flask import Flask, jsonify, Response
import random

from autometrics import autometrics
from prometheus_client import generate_latest

app = Flask(__name__)

@app.route('/')
def home():
"""
default route
"""
    return 'Welcome to the Dice Rolling API!'

# order of decorators is important
@app.route('/roll_dice')
@autometrics
def roll_dice():
"""
business logic
"""
    roll = random.randint(1, 6)
    response = {
        'roll': roll
    }
    return jsonify(response)

@app.get("/metrics")
def metrics():
"""
prometheus metrics for scraping
"""
    return Response(generate_latest())


@app.route('/ping')
@autometrics
def pinger():
"""
health check
"""
  return {'message': 'pong'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=23456, debug=True)
  • 检查自动生成的帮助和暴露指标(VSCODE将在悬停中显示)
❯ python
Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import app.py
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'app.py'; 'app' is not a package
>>> import app
# >>> print(app.roll_dice.__doc__)
Prometheus Query URLs for Function - roll_dice and Module - app:

Request rate URL : http://localhost:9090/graph?g0.expr=sum%20by%20%28function%2C%20module%2C%20commit%2C%20version%29%20%28rate%20%28function_calls_count_total%7Bfunction%3D%22roll_dice%22%2Cmodule%3D%22app%22%7D%5B5m%5D%29%20%2A%20on%20%28instance%2C%20job%29%20group_left%28version%2C%20commit%29%20%28last_over_time%28build_info%5B1s%5D%29%20or%20on%20%28instance%2C%20job%29%20up%29%29&g0.tab=0

Latency URL : http://localhost:9090/graph?g0.expr=sum%20by%20%28le%2C%20function%2C%20module%2C%20commit%2C%20version%29%20%28rate%28function_calls_duration_bucket%7Bfunction%3D%22roll_dice%22%2Cmodule%3D%22app%22%7D%5B5m%5D%29%20%2A%20on%20%28instance%2C%20job%29%20group_left%28version%2C%20commit%29%20%28last_over_time%28build_info%5B1s%5D%29%20or%20on%20%28instance%2C%20job%29%20up%29%29&g0.tab=0

Error Ratio URL : http://localhost:9090/graph?g0.expr=sum%20by%20%28function%2C%20module%2C%20commit%2C%20version%29%20%28rate%20%28function_calls_count_total%7Bfunction%3D%22roll_dice%22%2Cmodule%3D%22app%22%2C%20result%3D%22error%22%7D%5B5m%5D%29%20%2A%20on%20%28instance%2C%20job%29%20group_left%28version%2C%20commit%29%20%28last_over_time%28build_info%5B1s%5D%29%20or%20on%20%28instance%2C%20job%29%20up%29%29%20/%20sum%20by%20%28function%2C%20module%2C%20commit%2C%20version%29%20%28rate%20%28function_calls_count_total%7Bfunction%3D%22roll_dice%22%2Cmodule%3D%22app%22%7D%5B5m%5D%29%20%2A%20on%20%28instance%2C%20job%29%20group_left%28version%2C%20commit%29%20%28last_over_time%28build_info%5B1s%5D%29%20or%20on%20%28instance%2C%20job%29%20up%29%29&g0.tab=0

-------------------------------------------
  • 运行应用程序,生成一些负载
for i in $(seq 1 10); do
http http://127.0.0.1:23456
http http://127.0.0.1:23456/ping
http http://127.0.0.1:23456/roll_dice
sleep 1
done
  • 检查指标是否被暴露
❯ curl -q localhost:23456/metrics | grep function
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1471  100  1471    0     0  95687      0 --:--:-- --:--:-- --:--:--  239k
# HELP function_calls_count_total Autometrics counter for tracking function calls
# TYPE function_calls_count_total counter
function_calls_count_total{caller="dispatch_request",function="roll_dice",module="app",objective_name="",objective_percentile="",result="ok"} 30.0

Image description


注意:此框架也可与OpenTelemetry一起使用-more info