运维监控-prometheus的安装搭建

特性

  • 基于时间序列模型的监控
  • 基于K/V键值对的数据模型(格式简单,速度快,执行简单)
  • 采样数据查询基于数学运算公式
  • 基于http pull/push两种对应的数据采集传输方式
  • push方法非常灵活
  • 自带图形调试

Prometheus下载地址

1
https://prometheus.io/download/

监控的常用分类

监控分类

  • 业务监控
    用户访问Qps(每秒访问请求量),DAU日活,访问状态,业务接口,产品转化率,充值额度,用户投诉等
  • 系统监控
    与操作系统相关的基本监控项 cpu/内存/硬盘/IO/TCP连接/流量等
  • 网络监控
    对网络状态的监控,比如丢包率,延迟等
  • 日志监控
    单独设计和搭建
  • 程序监控
    一般需开发人员配合,程序中嵌入各种接口,直接获取数据或者特定的日志格式

metrics的主要类型

  • gauges
    最简单的度量指标,只有一个简单的返回值,或者瞬时状态

  • counters
    计数器,从数据0开始累积计算,理想状态下只能是永远增长或保持不变

  • histograms
    统计数据的分布情况

常用函数

  • {} 用来过滤

  • ‘>’ ,>= 用来比较,也是过滤

  • increase()
    用来针对counter类型这种持续增长的数值,截取其中一段时间的增量总量
    比如: 截取cpu总使用时间在1分钟内的总的增量 increase(node_cpu[1m])

  • rate()
    专门配合counter类型数据使用的函数,按照设置一个时间段,取counter在这个时间段中的平均每秒的增量

  • sum()
    把所有结果集进行加合,
    比如,计算cpu使用率的时候,sum计算出来的结果是所有服务器的结果,不区分单台服务器,显然不是合理的结果

  • topk()
    topk(3,count_netstat_wait_connections)
    根据给定的数字,取数值最高>=x的数值,一般用来做瞬时告警

  • count()
    把数值符合条件的输出数目进行加合

  • by (instance)
    与sum配合使用是,按照某种标签进行加合

prometheus服务端安装

Prometheus主监控服务程序

下载

prometheus安装非常简单,下载后解压缩即可直接启动运行

1
wget wget https://github.com/prometheus/prometheus/releases/download/v2.33.4/prometheus-2.33.4.linux-amd64.tar.gz

安装并配置systemctl启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#创建prometheus用户
useradd -M -r -s /bin/false prometheus

#创建配置文件目录和数据存储目录
mkdir /etc/prometheus /var/lib/prometheus

#解压并拷贝文件到对应目录
tar zxvf prometheus-2.28.1.linux-amd64.tar.gz

cp prometheus-2.28.1.linux-amd64/{prometheus,promtool} /usr/local/bin/
cp -r prometheus-2.28.1.linux-amd64/{consoles,console_libraries} /etc/prometheus/
cp prometheus-2.28.1.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml

chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}
chown -R prometheus:prometheus /etc/prometheus
chown prometheus:prometheus /var/lib/prometheus

#创建service启动文件
cat > /etc/systemd/system/prometheus.service <<EOF

[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--storage.tsdb.retention=15d \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--query.timeout=2m \
--query.max-concurrency=20 \
--web.read-timeout=5m \
--web.max-connections=512
Restart=on-failure

[Install]
WantedBy=multi-user.target

EOF

#启动服务并设置开机自启
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus


启动时设置的一些参数

  • –web.read-timeout=5m
    请求链接的最大等待时间,防止太多的空闲链接占用资源

  • –web.max-connections=512
    最大链接数

  • –storage.tsdb.retention=15d
    prometheus开始采集的数据会保存在内存和硬盘中,如果不设置期限,硬盘盒内存吃不消。太短历史数据又会没有,需要合理设置

  • –storage.tsdb.path=”/data”
    存储数据路径

  • query.timeout=2m

  • query.max-concurrency=20
    这两个选项是对用户执行prometheus查询时的优化设置
    防止太多的用户同时查询,也防止单个用户执行过大的查询而一直不退出

node_exporter的安装

node_exporter 节点基础监控程序

下载

1
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz

安装并配置systemctl启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
useradd -M -r -s /bin/false node_exporter

tar zxvf node_exporter-1.3.1.linux-amd64.tar.gz

cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter

cat > /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

EOF

#启动服务并设置开机自启
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter


添加prometheus的target监控

修改prometheus的配置文件,两种方式

  • 1.静态加载配置文件,修改后需要重启prometheus服务
    1
    2
    3
    4
    5
    6
    7
    8
    9
    vim /etc/prometheus/prometheus.yml
    scrape_configs:
    #静态添加node_exporter
    - job_name: 'Linux'
    static_configs:
    #注意在prometheus主机上也应配置node_exporter的9100端口而不少prometheus的9090端口
    - targets: ['localhost:9100']
    labels:
    instance: test
  • 2.动态加载配置
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    vim /etc/prometheus/prometheus.yml
    scrape_configs:
    #动态添加node_exporter
    - job_name: 'DT_configs'
    file_sd_configs:
    - files: [/etc/prometheus.d/*.yml]
    refresh_interval: 5s

    vi /etc/prometheus.d/test.yml
    - targets: ['localhost:9100']
    labels:
    instance: test

配置prometheus的target监控告警规则

第一步 添加告警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# 创建告警规则目录,方便管理
mdkir /etc/prometheus/rules

# 以创建linux系统基础监控为例
vi /etc/prometheus/rules/linux_system.yml
groups:
- name: example
rules:

- alert: 实例丢失
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: page
annotations:
summary: "服务器实例 {{ $labels.instance }} 丢失"
description: "{{ $labels.instance }} 上的任务 {{ $labels.job }} 已经停止了 1 分钟已上了"

- alert: 磁盘容量小于 5%
expr: 100 - ((node_filesystem_avail_bytes{job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"} * 100) / node_filesystem_size_bytes {job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"}) > 95
for: 30s
annotations:
summary: "服务器实例 {{ $labels.instance }} 磁盘不足 告警通知"
description: "{{ $labels.instance }}磁盘 {{ $labels.device }} 资源 已不足 5%, 当前值: {{ $value }}"

- alert: "内存容量小于 20%"
expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )) * 100 > 80
for: 30s
labels:
severity: warning
annotations:
summary: "服务器实例 {{ $labels.instance }} 内存不足 告警通知"
description: "{{ $labels.instance }}内存资源已不足 20%,当前值: {{ $value }}"

- alert: "CPU 平均负载大于 4 个"
expr: node_load5 > 4
for: 30s
annotations:
sumary: "服务器实例 {{ $labels.instance }} CPU 负载 告警通知"
description: "{{ $labels.instance }}CPU 平均负载(5 分钟) 已超过 4 ,当前值: {{ $value }}"

- alert: "磁盘读 I/O 超过 30MB/s"
expr: irate(node_disk_read_bytes_total{device="sda"}[1m]) > 30000000
for: 30s
annotations:
sumary: "服务器实例 {{ $labels.instance }} I/O 读负载 告警通知"
description: "{{ $labels.instance }}I/O 每分钟读已超过 30MB/s,当前值: {{ $value }}"

- alert: "磁盘写 I/O 超过 30MB/s"
expr: irate(node_disk_written_bytes_total{device="sda"}[1m]) > 30000000
for: 30s
annotations:
sumary: "服务器实例 {{ $labels.instance }} I/O 写负载 告警通知"
description: "{{ $labels.instance }}I/O 每分钟写已超过 30MB/s,当前值: {{ $value }}"

- alert: "网卡流出速率大于 10MB/s"
expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 1000000
for: 30s
annotations:
sumary: "服务器实例 {{ $labels.instance }} 网卡流量负载 告警通知"
description: "{{ $labels.instance }}网卡 {{ $labels.device }} 流量已经超过 10MB/s, 当前值: {{ $value }}"

- alert: "CPU 使用率大于 90%"
expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 90
for: 30s
annotations:
sumary: "服务器实例 {{ $labels.instance }} CPU 使用率 告警通知"
description: "{{ $labels.instance }}CPU 使用率已超过 90%, 当前值: {{ $value }}"


第二步 在prometheus.yml配置文件引入告警规则

1
2
3
4
5
vim /etc/prometheus/prometheus.yml
rule_files:
# *.yml表示目录下所有的yml文件,也可以指定具体的单个告警规则文件
- /etc/prometheus/rules/*.yml

alertmanger安装配置

alertmanger 告警程序

下载

1
2
wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
useradd --no-create-home --shell /bin/false alertmanager

tar zxvf alertmanager-0.23.0.linux-amd64.tar.gz
mkdir /etc/alertmanager
cp alertmanager-0.23.0.linux-amd64/{alertmanager,amtool} /usr/local/bin/
cp alertmanager-0.23.0.linux-amd64/alertmanager.yml /etc/alertmanager/

chown -R alertmanager:alertmanager /etc/alertmanager
chown -R alertmanager:alertmanager /usr/local/bin/{alertmanager,amtool}


cat > /etc/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Prometheus Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path="/etc/alertmanager/data/" \
--data.retention=60h \
--web.external-url http://0.0.0.0:9093
Restart=on-failure

[Install]
WantedBy=multi-user.target

EOF


systemctl daemon-reload
systemctl enable alertmanager
systemctl start alertmanager

配置alertmanager 与prometheus 告警管理

第一步 配置告警介质(邮箱,微信,钉钉等)

配置文件分四个模块,global, templates, route, receivers

global 定义alertmanager全局配置
1
2
3
4
5
6
7
8
9

# 定义alertmanager全局配置
global:
resolve_timeout: 5m # 定义持续多长时间没接收告警就标记为resolved
smtp_smarthost: 'smtp.qiye.aliyun.com:465'
smtp_from: 'mjin@erongdu.com'
smtp_auth_username: 'mjin@erongdu.com'
smtp_auth_password: '72nF9pU9LG'
smtp_require_tls: false
route 定义接收告警的处理方式,根据规则进行匹配并采取相应操作
1
2
3
4
5
6
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email.test'
邮箱发送告警
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

# 定义alertmanager全局配置
global:
resolve_timeout: 5m # 定义持续多长时间没接收告警就标记为resolved
smtp_smarthost: 'smtp.qiye.aliyun.com:465'
smtp_from: 'mjin@erongdu.com'
smtp_auth_username: 'mjin@erongdu.com'
smtp_auth_password: '72nF9pU9LG'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email.test'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'email.test'
email_configs:
- to: 'ops@erongdu.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
钉钉告警
1
2
钉钉告警插件地址:
https://github.com/timonwong/prometheus-webhook-dingtalk

安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
tar zxvf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz 
cp prometheus-webhook-dingtalk-2.0.0.linux-amd64/prometheus-webhook-dingtalk /usr/local/bin/


# 配置文件
vi /etc/prometheus-webhook-config.yml
## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
#templates:
# - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
# title: '{{ template "legacy.title" . }}'
# text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxx

# 添加启动文件
cat > /etc/systemd/system/prometheus-webhook-dingtalk.service <<EOF
[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/bin/prometheus-webhook-dingtalk --ding.profile=ops_dingding=自己钉钉机器人的Webhook地址

[Install]
WantedBy=multi-user.target

EOF


# 启动服务并设置开机自启动
systemctl daemon-reload
systemctl start prometheus-webhook-dingtalk


# 测试发送,链接中给你的webhook1为配置文件中的对应
curl -H "Content-Type: application/json" -d '{ "version": "4", "status": "firing", escription":"description_content"}' http://localhost:8060/dingtalk/webhook1/send

第二步 配置alertmanager 与 prometheus 告警关联

1
2
3
4
5
6
7
8
9
vim /etc/prometheus/prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
# alertmanager的url
- localhost:9093
# - alertmanager:9093

pushgateway

pushgateway 使用场景一般是prometheus无法直接获取节点机器监控指标时使用,节点机器将数据推送到pushgateway程序,然后Prometheus去pushgateway拿取数据

下载

1
wget https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz

安装并配置systemctl启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
tar zxvf pushgateway-1.4.2.linux-amd64.tar.gz
cd pushgateway-1.4.2.linux-amd64
cp pushgateway /usr/local/bin/

# 配置systemd
cat > /etc/systemd/system/pushgateway.service <<EOF
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/pushgateway
Restart=on-failure

[Install]
WantedBy=multi-user.target

EOF


grafana的安装配置

grafana图像展示

下载

1
2
3
wget https://dl.grafana.com/oss/release/grafana-8.4.1-1.x86_64.rpm
yum localinstall grafana-8.4.1-1.x86_64.rpm -y
systemctl start grafana-server

初始密码: admin/admin