特性
基于时间序列模型的监控
基于K/V键值对的数据模型(格式简单,速度快,执行简单)
采样数据查询基于数学运算公式
基于http pull/push两种对应的数据采集传输方式
push方法非常灵活
自带图形调试
Prometheus下载地址 1 https://prometheus.io/download/
监控的常用分类 监控分类
业务监控 用户访问Qps(每秒访问请求量),DAU日活,访问状态,业务接口,产品转化率,充值额度,用户投诉等
系统监控 与操作系统相关的基本监控项 cpu/内存/硬盘/IO/TCP连接/流量等
网络监控 对网络状态的监控,比如丢包率,延迟等
日志监控 单独设计和搭建
程序监控 一般需开发人员配合,程序中嵌入各种接口,直接获取数据或者特定的日志格式
metrics的主要类型
常用函数
{} 用来过滤
‘>’ ,>= 用来比较,也是过滤
increase() 用来针对counter类型这种持续增长的数值,截取其中一段时间的增量总量 比如: 截取cpu总使用时间在1分钟内的总的增量 increase(node_cpu[1m])
rate() 专门配合counter类型数据使用的函数,按照设置一个时间段,取counter在这个时间段中的平均每秒的增量
sum() 把所有结果集进行加合, 比如,计算cpu使用率的时候,sum计算出来的结果是所有服务器的结果,不区分单台服务器,显然不是合理的结果
topk() topk(3,count_netstat_wait_connections) 根据给定的数字,取数值最高>=x的数值,一般用来做瞬时告警
count() 把数值符合条件的输出数目进行加合
by (instance) 与sum配合使用是,按照某种标签进行加合
prometheus服务端安装 Prometheus主监控服务程序
下载 prometheus安装非常简单,下载后解压缩即可直接启动运行
1 wget wget https://github.com/prometheus/prometheus/releases/download/v2.33.4/prometheus-2.33.4.linux-amd64.tar.gz
安装并配置systemctl启动 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 useradd -M -r -s /bin/false prometheus mkdir /etc/prometheus /var/lib/prometheus tar zxvf prometheus-2.28.1.linux-amd64.tar.gz cp prometheus-2.28.1.linux-amd64/{prometheus,promtool} /usr/local /bin/ cp -r prometheus-2.28.1.linux-amd64/{consoles,console_libraries} /etc/prometheus/ cp prometheus-2.28.1.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml chown prometheus:prometheus /usr/local /bin/{prometheus,promtool} chown -R prometheus:prometheus /etc/prometheus chown prometheus:prometheus /var/lib/prometheus cat > /etc/systemd/system/prometheus.service <<EOF [Unit] Description=Prometheus Time Series Collection and Processing Server Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --storage.tsdb.retention=15d \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries \ --query.timeout=2m \ --query.max-concurrency=20 \ --web.read-timeout=5m \ --web.max-connections=512 Restart=on-failure [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable prometheus systemctl start prometheus
启动时设置的一些参数
–web.read-timeout=5m 请求链接的最大等待时间,防止太多的空闲链接占用资源
–web.max-connections=512 最大链接数
–storage.tsdb.retention=15d prometheus开始采集的数据会保存在内存和硬盘中,如果不设置期限,硬盘盒内存吃不消。太短历史数据又会没有,需要合理设置
–storage.tsdb.path=”/data” 存储数据路径
query.timeout=2m
query.max-concurrency=20 这两个选项是对用户执行prometheus查询时的优化设置 防止太多的用户同时查询,也防止单个用户执行过大的查询而一直不退出
node_exporter的安装 node_exporter 节点基础监控程序
下载 1 wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
安装并配置systemctl启动 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 useradd -M -r -s /bin/false node_exporter tar zxvf node_exporter-1.3.1.linux-amd64.tar.gz cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/local /bin/ chown node_exporter:node_exporter /usr/local /bin/node_exporter cat > /etc/systemd/system/node_exporter.service <<EOF [Unit] Description=Prometheus Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter
添加prometheus的target监控 修改prometheus的配置文件,两种方式
1.静态加载配置文件,修改后需要重启prometheus服务1 2 3 4 5 6 7 8 9 vim /etc/prometheus/prometheus.yml scrape_configs: - job_name: 'Linux' static_configs: - targets: ['localhost:9100' ] labels: instance: test
2.动态加载配置1 2 3 4 5 6 7 8 9 10 11 12 13 vim /etc/prometheus/prometheus.yml scrape_configs: - job_name: 'DT_configs' file_sd_configs: - files: [/etc/prometheus.d/*.yml] refresh_interval: 5s vi /etc/prometheus.d/test.yml - targets: ['localhost:9100' ] labels: instance: test
配置prometheus的target监控告警规则 第一步 添加告警规则 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 mdkir /etc/prometheus/rules vi /etc/prometheus/rules/linux_system.yml groups: - name: example rules: - alert: 实例丢失 expr: up{job="node-exporter" } == 0 for : 1m labels: severity: page annotations: summary: "服务器实例 {{ $labels .instance }} 丢失" description: "{{ $labels .instance }} 上的任务 {{ $labels .job }} 已经停止了 1 分钟已上了" - alert: 磁盘容量小于 5% expr: 100 - ((node_filesystem_avail_bytes{job="node-exporter" ,mountpoint=~".*" ,fstype=~"ext4|xfs|ext2|ext3" } * 100) / node_filesystem_size_bytes {job="node-exporter" ,mountpoint=~".*" ,fstype=~"ext4|xfs|ext2|ext3" }) > 95 for : 30s annotations: summary: "服务器实例 {{ $labels .instance }} 磁盘不足 告警通知" description: "{{ $labels .instance }}磁盘 {{ $labels .device }} 资源 已不足 5%, 当前值: {{ $value }}" - alert: "内存容量小于 20%" expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )) * 100 > 80 for : 30s labels: severity: warning annotations: summary: "服务器实例 {{ $labels .instance }} 内存不足 告警通知" description: "{{ $labels .instance }}内存资源已不足 20%,当前值: {{ $value }}" - alert: "CPU 平均负载大于 4 个" expr: node_load5 > 4 for : 30s annotations: sumary: "服务器实例 {{ $labels .instance }} CPU 负载 告警通知" description: "{{ $labels .instance }}CPU 平均负载(5 分钟) 已超过 4 ,当前值: {{ $value }}" - alert: "磁盘读 I/O 超过 30MB/s" expr: irate(node_disk_read_bytes_total{device="sda" }[1m]) > 30000000 for : 30s annotations: sumary: "服务器实例 {{ $labels .instance }} I/O 读负载 告警通知" description: "{{ $labels .instance }}I/O 每分钟读已超过 30MB/s,当前值: {{ $value }}" - alert: "磁盘写 I/O 超过 30MB/s" expr: irate(node_disk_written_bytes_total{device="sda" }[1m]) > 30000000 for : 30s annotations: sumary: "服务器实例 {{ $labels .instance }} I/O 写负载 告警通知" description: "{{ $labels .instance }}I/O 每分钟写已超过 30MB/s,当前值: {{ $value }}" - alert: "网卡流出速率大于 10MB/s" expr: (irate(node_network_transmit_bytes_total{device!~"lo" }[1m]) / 1000) > 1000000 for : 30s annotations: sumary: "服务器实例 {{ $labels .instance }} 网卡流量负载 告警通知" description: "{{ $labels .instance }}网卡 {{ $labels .device }} 流量已经超过 10MB/s, 当前值: {{ $value }}" - alert: "CPU 使用率大于 90%" expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle" }[30s]))) *100) > 90 for : 30s annotations: sumary: "服务器实例 {{ $labels .instance }} CPU 使用率 告警通知" description: "{{ $labels .instance }}CPU 使用率已超过 90%, 当前值: {{ $value }}"
第二步 在prometheus.yml配置文件引入告警规则 1 2 3 4 5 vim /etc/prometheus/prometheus.yml rule_files: - /etc/prometheus/rules/*.yml
alertmanger安装配置 alertmanger 告警程序
下载 1 2 wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 useradd --no-create-home --shell /bin/false alertmanager tar zxvf alertmanager-0.23.0.linux-amd64.tar.gz mkdir /etc/alertmanager cp alertmanager-0.23.0.linux-amd64/{alertmanager,amtool} /usr/local /bin/ cp alertmanager-0.23.0.linux-amd64/alertmanager.yml /etc/alertmanager/ chown -R alertmanager:alertmanager /etc/alertmanager chown -R alertmanager:alertmanager /usr/local /bin/{alertmanager,amtool} cat > /etc/systemd/system/alertmanager.service <<EOF [Unit] Description=Prometheus Alertmanager Wants=network-online.target After=network-online.target [Service] User=alertmanager Group=alertmanager Type=simple ExecStart=/usr/local/bin/alertmanager \ --config.file=/etc/alertmanager/alertmanager.yml \ --storage.path="/etc/alertmanager/data/" \ --data.retention=60h \ --web.external-url http://0.0.0.0:9093 Restart=on-failure [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable alertmanager systemctl start alertmanager
配置alertmanager 与prometheus 告警管理 第一步 配置告警介质(邮箱,微信,钉钉等) 配置文件分四个模块,global, templates, route, receivers
global 定义alertmanager全局配置 1 2 3 4 5 6 7 8 9 global: resolve_timeout: 5m smtp_smarthost: 'smtp.qiye.aliyun.com:465' smtp_from: 'mjin@erongdu.com' smtp_auth_username: 'mjin@erongdu.com' smtp_auth_password: '72nF9pU9LG' smtp_require_tls: false
route 定义接收告警的处理方式,根据规则进行匹配并采取相应操作 1 2 3 4 5 6 route: group_by: ['alertname' ] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'email.test'
邮箱发送告警 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 global: resolve_timeout: 5m smtp_smarthost: 'smtp.qiye.aliyun.com:465' smtp_from: 'mjin@erongdu.com' smtp_auth_username: 'mjin@erongdu.com' smtp_auth_password: '72nF9pU9LG' smtp_require_tls: false route: group_by: ['alertname' ] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'email.test' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' - name: 'email.test' email_configs: - to: 'ops@erongdu.com' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname' , 'dev' , 'instance' ]
钉钉告警 1 2 钉钉告警插件地址: https://github.com/timonwong/prometheus-webhook-dingtalk
安装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 tar zxvf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz cp prometheus-webhook-dingtalk-2.0.0.linux-amd64/prometheus-webhook-dingtalk /usr/local /bin/ vi /etc/prometheus-webhook-config.yml targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=xxxxx cat > /etc/systemd/system/prometheus-webhook-dingtalk.service <<EOF [Unit] Description=prometheus-webhook-dingtalk After=network-online.target [Service] Restart=on-failure ExecStart=/usr/local/bin/prometheus-webhook-dingtalk --ding.profile=ops_dingding=自己钉钉机器人的Webhook地址 [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl start prometheus-webhook-dingtalk curl -H "Content-Type: application/json" -d '{ "version": "4", "status": "firing", escription":"description_content"}' http://localhost:8060/dingtalk/webhook1/send
第二步 配置alertmanager 与 prometheus 告警关联 1 2 3 4 5 6 7 8 9 vim /etc/prometheus/prometheus.yml alerting: alertmanagers: - static_configs: - targets: - localhost:9093
pushgateway pushgateway 使用场景一般是prometheus无法直接获取节点机器监控指标时使用,节点机器将数据推送到pushgateway程序,然后Prometheus去pushgateway拿取数据
下载 1 wget https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz
安装并配置systemctl启动 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 tar zxvf pushgateway-1.4.2.linux-amd64.tar.gz cd pushgateway-1.4.2.linux-amd64cp pushgateway /usr/local /bin/ cat > /etc/systemd/system/pushgateway.service <<EOF [Unit] Description=Prometheus Node Exporter Wants=network-online.target After=network-online.target [Service] Type=simple ExecStart=/usr/local/bin/pushgateway Restart=on-failure [Install] WantedBy=multi-user.target EOF
grafana的安装配置 grafana图像展示
下载 1 2 3 wget https://dl.grafana.com/oss/release/grafana-8.4.1-1.x86_64.rpm yum localinstall grafana-8.4.1-1.x86_64.rpm -y systemctl start grafana-server
初始密码: admin/admin