运维监控-prometheus的安装搭建

Linke Fun

2022-03-07

monitor

特性

基于时间序列模型的监控
基于K/V键值对的数据模型(格式简单，速度快，执行简单)
采样数据查询基于数学运算公式
基于http pull/push两种对应的数据采集传输方式
push方法非常灵活
自带图形调试

Prometheus下载地址

1	https://prometheus.io/download/

监控的常用分类

监控分类

业务监控
用户访问Qps(每秒访问请求量),DAU日活，访问状态，业务接口，产品转化率，充值额度，用户投诉等
系统监控
与操作系统相关的基本监控项 cpu/内存/硬盘/IO/TCP连接/流量等
网络监控
对网络状态的监控，比如丢包率，延迟等
日志监控
单独设计和搭建
程序监控
一般需开发人员配合，程序中嵌入各种接口，直接获取数据或者特定的日志格式

metrics的主要类型

gauges
最简单的度量指标，只有一个简单的返回值，或者瞬时状态
counters
计数器，从数据0开始累积计算，理想状态下只能是永远增长或保持不变
histograms
统计数据的分布情况

常用函数

{} 用来过滤
‘>’ ,>= 用来比较，也是过滤
increase()
用来针对counter类型这种持续增长的数值，截取其中一段时间的增量总量
比如: 截取cpu总使用时间在1分钟内的总的增量 increase(node_cpu[1m])
rate()
专门配合counter类型数据使用的函数，按照设置一个时间段，取counter在这个时间段中的平均每秒的增量
sum()
把所有结果集进行加合，
比如，计算cpu使用率的时候，sum计算出来的结果是所有服务器的结果，不区分单台服务器，显然不是合理的结果
topk()
topk(3,count_netstat_wait_connections)
根据给定的数字，取数值最高>=x的数值,一般用来做瞬时告警
count()
把数值符合条件的输出数目进行加合
by (instance)
与sum配合使用是，按照某种标签进行加合

prometheus服务端安装

Prometheus主监控服务程序

下载

prometheus安装非常简单，下载后解压缩即可直接启动运行

1	wget wget https://github.com/prometheus/prometheus/releases/download/v2.33.4/prometheus-2.33.4.linux-amd64.tar.gz

安装并配置systemctl启动

#创建prometheus用户
useradd -M -r -s /bin/false prometheus

#创建配置文件目录和数据存储目录
mkdir /etc/prometheus /var/lib/prometheus

#解压并拷贝文件到对应目录
tar zxvf prometheus-2.28.1.linux-amd64.tar.gz

cp prometheus-2.28.1.linux-amd64/{prometheus,promtool} /usr/local/bin/
cp -r prometheus-2.28.1.linux-amd64/{consoles,console_libraries} /etc/prometheus/
cp prometheus-2.28.1.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml

chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}
chown -R prometheus:prometheus /etc/prometheus
chown prometheus:prometheus /var/lib/prometheus

#创建service启动文件
cat > /etc/systemd/system/prometheus.service <<EOF

[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --storage.tsdb.retention=15d \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --query.timeout=2m \
    --query.max-concurrency=20 \
    --web.read-timeout=5m \
    --web.max-connections=512 
Restart=on-failure

[Install]
WantedBy=multi-user.target

EOF

#启动服务并设置开机自启
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus

启动时设置的一些参数

–web.read-timeout=5m
请求链接的最大等待时间，防止太多的空闲链接占用资源
–web.max-connections=512
最大链接数
–storage.tsdb.retention=15d
prometheus开始采集的数据会保存在内存和硬盘中，如果不设置期限，硬盘盒内存吃不消。太短历史数据又会没有，需要合理设置
–storage.tsdb.path=”/data”
存储数据路径
query.timeout=2m
query.max-concurrency=20
这两个选项是对用户执行prometheus查询时的优化设置
防止太多的用户同时查询，也防止单个用户执行过大的查询而一直不退出

node_exporter的安装

node_exporter 节点基础监控程序

下载

1	wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz

安装并配置systemctl启动

useradd -M -r -s /bin/false node_exporter

tar zxvf node_exporter-1.3.1.linux-amd64.tar.gz

cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter

cat > /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

EOF

#启动服务并设置开机自启
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

添加prometheus的target监控

修改prometheus的配置文件，两种方式

1.静态加载配置文件，修改后需要重启prometheus服务

vim /etc/prometheus/prometheus.yml
scrape_configs:
    #静态添加node_exporter
  - job_name: 'Linux'
    static_configs:
    #注意在prometheus主机上也应配置node_exporter的9100端口而不少prometheus的9090端口
    - targets: ['localhost:9100']
      labels:
        instance: test

2.动态加载配置

vim /etc/prometheus/prometheus.yml
scrape_configs:
    #动态添加node_exporter
  - job_name: 'DT_configs'
    file_sd_configs:
      - files: [/etc/prometheus.d/*.yml]
        refresh_interval: 5s

vi /etc/prometheus.d/test.yml
- targets: ['localhost:9100']
  labels:
    instance: test

配置prometheus的target监控告警规则

第一步添加告警规则

# 创建告警规则目录,方便管理
mdkir /etc/prometheus/rules

# 以创建linux系统基础监控为例
vi /etc/prometheus/rules/linux_system.yml
groups:
- name: example
  rules:
 
  - alert: 实例丢失
    expr: up{job="node-exporter"} == 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "服务器实例 {{ $labels.instance }} 丢失"
      description: "{{ $labels.instance }} 上的任务 {{ $labels.job }} 已经停止了 1 分钟已上了"
 
  - alert: 磁盘容量小于 5%
    expr: 100 - ((node_filesystem_avail_bytes{job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"} * 100) / node_filesystem_size_bytes {job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"}) > 95
    for: 30s
    annotations:
      summary: "服务器实例 {{ $labels.instance }} 磁盘不足 告警通知"
      description: "{{ $labels.instance }}磁盘 {{ $labels.device }} 资源 已不足 5%, 当前值: {{ $value }}"
 
  - alert: "内存容量小于 20%"
    expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )) * 100 > 80
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "服务器实例 {{ $labels.instance }} 内存不足 告警通知"
      description: "{{ $labels.instance }}内存资源已不足 20%,当前值: {{ $value }}"
 
  - alert: "CPU 平均负载大于 4 个"
    expr: node_load5 > 4
    for: 30s
    annotations:
      sumary: "服务器实例 {{ $labels.instance }} CPU 负载 告警通知"
      description: "{{ $labels.instance }}CPU 平均负载(5 分钟) 已超过 4 ,当前值: {{ $value }}"
 
  - alert: "磁盘读 I/O 超过 30MB/s"
    expr: irate(node_disk_read_bytes_total{device="sda"}[1m]) > 30000000
    for: 30s
    annotations:
      sumary: "服务器实例 {{ $labels.instance }} I/O 读负载 告警通知"
      description: "{{ $labels.instance }}I/O 每分钟读已超过 30MB/s,当前值: {{ $value }}"
 
  - alert: "磁盘写 I/O 超过 30MB/s"
    expr: irate(node_disk_written_bytes_total{device="sda"}[1m]) > 30000000
    for: 30s
    annotations:
      sumary: "服务器实例 {{ $labels.instance }} I/O 写负载 告警通知"
      description: "{{ $labels.instance }}I/O 每分钟写已超过 30MB/s,当前值: {{ $value }}"
 
  - alert: "网卡流出速率大于 10MB/s"
    expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 1000000
    for: 30s
    annotations:
      sumary: "服务器实例 {{ $labels.instance }} 网卡流量负载 告警通知"
      description: "{{ $labels.instance }}网卡 {{ $labels.device }} 流量已经超过 10MB/s, 当前值: {{ $value }}"
 
  - alert: "CPU 使用率大于 90%"
    expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 90
    for: 30s
    annotations:
      sumary: "服务器实例 {{ $labels.instance }} CPU 使用率 告警通知"
      description: "{{ $labels.instance }}CPU 使用率已超过 90%, 当前值: {{ $value }}"

第二步在prometheus.yml配置文件引入告警规则

vim /etc/prometheus/prometheus.yml
rule_files:
    # *.yml表示目录下所有的yml文件，也可以指定具体的单个告警规则文件
  - /etc/prometheus/rules/*.yml

alertmanger安装配置

alertmanger 告警程序

下载

1 2	wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz

useradd --no-create-home --shell /bin/false alertmanager

tar zxvf alertmanager-0.23.0.linux-amd64.tar.gz
mkdir /etc/alertmanager
cp alertmanager-0.23.0.linux-amd64/{alertmanager,amtool} /usr/local/bin/
cp alertmanager-0.23.0.linux-amd64/alertmanager.yml /etc/alertmanager/

chown -R alertmanager:alertmanager /etc/alertmanager
chown -R alertmanager:alertmanager /usr/local/bin/{alertmanager,amtool}


cat > /etc/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Prometheus Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
        --config.file=/etc/alertmanager/alertmanager.yml \
        --storage.path="/etc/alertmanager/data/" \
        --data.retention=60h \
        --web.external-url http://0.0.0.0:9093
Restart=on-failure

[Install]
WantedBy=multi-user.target

EOF


systemctl daemon-reload
systemctl enable alertmanager
systemctl start alertmanager

配置alertmanager 与prometheus 告警管理

第一步配置告警介质(邮箱，微信，钉钉等)

配置文件分四个模块，global, templates, route, receivers

global 定义alertmanager全局配置


# 定义alertmanager全局配置
global:
  resolve_timeout: 5m # 定义持续多长时间没接收告警就标记为resolved
  smtp_smarthost: 'smtp.qiye.aliyun.com:465'
  smtp_from: 'mjin@erongdu.com'
  smtp_auth_username: 'mjin@erongdu.com'
  smtp_auth_password: '72nF9pU9LG'
  smtp_require_tls: false

route 定义接收告警的处理方式，根据规则进行匹配并采取相应操作

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email.test'

邮箱发送告警


# 定义alertmanager全局配置
global:
  resolve_timeout: 5m # 定义持续多长时间没接收告警就标记为resolved
  smtp_smarthost: 'smtp.qiye.aliyun.com:465'
  smtp_from: 'mjin@erongdu.com'
  smtp_auth_username: 'mjin@erongdu.com'
  smtp_auth_password: '72nF9pU9LG'
  smtp_require_tls: false
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email.test'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
- name: 'email.test'
  email_configs:
  - to: 'ops@erongdu.com'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

钉钉告警

1 2	钉钉告警插件地址： https://github.com/timonwong/prometheus-webhook-dingtalk

安装

tar zxvf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz 
 cp prometheus-webhook-dingtalk-2.0.0.linux-amd64/prometheus-webhook-dingtalk /usr/local/bin/


# 配置文件
vi /etc/prometheus-webhook-config.yml 
## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
#templates:
#  - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
#  title: '{{ template "legacy.title" . }}'
#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxx

# 添加启动文件
cat > /etc/systemd/system/prometheus-webhook-dingtalk.service <<EOF
[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/bin/prometheus-webhook-dingtalk --ding.profile=ops_dingding=自己钉钉机器人的Webhook地址

[Install]
WantedBy=multi-user.target

EOF


# 启动服务并设置开机自启动
systemctl daemon-reload
systemctl start prometheus-webhook-dingtalk


# 测试发送，链接中给你的webhook1为配置文件中的对应
curl   -H "Content-Type: application/json"  -d '{ "version": "4", "status": "firing", escription":"description_content"}'  http://localhost:8060/dingtalk/webhook1/send

第二步配置alertmanager 与 prometheus 告警关联

vim /etc/prometheus/prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            # alertmanager的url
          - localhost:9093
          # - alertmanager:9093

pushgateway

pushgateway 使用场景一般是prometheus无法直接获取节点机器监控指标时使用，节点机器将数据推送到pushgateway程序，然后Prometheus去pushgateway拿取数据

下载

1	wget https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz

安装并配置systemctl启动

tar zxvf pushgateway-1.4.2.linux-amd64.tar.gz
cd pushgateway-1.4.2.linux-amd64
cp pushgateway /usr/local/bin/

# 配置systemd
cat > /etc/systemd/system/pushgateway.service <<EOF
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/pushgateway
Restart=on-failure

[Install]
WantedBy=multi-user.target

EOF

grafana的安装配置

grafana图像展示

下载

1
2
3

wget https://dl.grafana.com/oss/release/grafana-8.4.1-1.x86_64.rpm
yum localinstall grafana-8.4.1-1.x86_64.rpm -y
systemctl start grafana-server

初始密码： admin/admin

特性

Prometheus下载地址

监控的常用分类

metrics的主要类型

常用函数

prometheus服务端安装

下载

安装并配置systemctl启动

node_exporter的安装

下载

安装并配置systemctl启动

添加prometheus的target监控

配置prometheus的target监控告警规则

第一步 添加告警规则

第二步 在prometheus.yml配置文件引入告警规则

alertmanger安装配置

下载

配置alertmanager 与prometheus 告警管理

第一步 配置告警介质(邮箱，微信，钉钉等)

global 定义alertmanager全局配置

route 定义接收告警的处理方式，根据规则进行匹配并采取相应操作

邮箱发送告警

钉钉告警

第二步 配置alertmanager 与 prometheus 告警关联

pushgateway

下载

安装并配置systemctl启动

grafana的安装配置

下载

第一步添加告警规则

第二步在prometheus.yml配置文件引入告警规则

第一步配置告警介质(邮箱，微信，钉钉等)

第二步配置alertmanager 与 prometheus 告警关联