prometheus技术视图

Prometheus技术视图

Foreward

在学习每个新技术之前,都有必要弄懂以下6个问题(左耳朵耗子学习模版):

  • 这个技术出现的背景、初衷和要达到什么样的目标或是要解决什么样的问题
  • 这个技术的优势和劣势分别是什么,或者说这个技术的 trade-off 是什么。任何技术都有其好坏,在解决一个问题的时候,也会带来新的问题.
  • 适用的场景(业务场景或技术场景)
  • 它的组成部分和关键点
  • 它的底层原理和关键实现
  • 已有的实现和它之间的对比

由于最近在搞代理,需要一个监控和报警来保证稳定性.我的需求也很简单,监控500+主机是否存活(通过ping或ssh的方式检测),如果出现问题,立马报警.

其实官网已经说的很明白了,我也没必要重复搬运,这里只记录自己的使用和比较关注的信息.

Resources

Environment

  • prometheus 2.3.1
  • blackbox_exporter 0.12.0
  • alertmanager 0.15.1
  • centos 7

Notices

  • Prometheus即是一个CPU密集型(查询)也是一个IO密集型(数据落地)的,CPU数量是多多益善,内存越大越好(来缓存抓取的数据,所以应该减少不必要的业务数据导出),尽量要使用SSD(这个很关键!),因为一旦Prometheus的内存使用量达到阈值会停止抓取数据!这个停止抓取的时间,至少是分钟级,甚至是无法恢复!所以只要有条件就要用SSD.

Prometheus

Install

部署测试场景

Config prometheus.yml

blackbox_exporter和alertmanager的安装可以参见另外两篇文章《alertmanager之初体验》、《blackbox_exporter之安装》

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# 由于机器数量较多,使用prometheus的服务发现,将所有机器配置到单独的文件中.
$ vim /opt/machine.json
[
{"targets":["122.227.184.61:20065"],"labels":{"machineName":"guangzhou77d9"}},
{"targets":["122.227.184.61:20063"],"labels":{"machineName":"guangzhou77d8"}},
...
]
# 配置prometheus.yml
$ vim /opt/prometheus-2.3.1/prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["10.1.1.26:9093"] # alertmanager的地址.
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/opt/proxy_rules.yml" # 规则文件.
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
- job_name: 'machine_heart'
metrics_path: /probe
params:
module: [ssh_banner]
file_sd_configs:
- files: ['/opt/machine.json'] # 机器地址列表文件
refresh_interval: 5m
relabel_configs:
- source_labels: [address]
regex: (.*)
replacement: ${1}
target_label: __param_target # 请求http://10.1.1.26:9155的请求参数target
- source_labels: [__param_target]
target_label: instance
- target_label: address
replacement: 10.1.1.26:9115 # blackbox_exporter的地址.

Config rules.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
$ vim /opt/proxy_rules.yml
groups:
- name: machine_heart
rules:
- alert: 'ssh'
expr: probe_success{job="machine_heart"} < 1
for: 1m
labels:
severity: critical
# team: 'proxy-team'
annotations:
summary: "Instance {{ $labels.instance }} down."
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

Config Systemd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=prometheus.service
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
ExecStart=/opt/prometheus-2.3.1/prometheus --config.file=/opt/prometheus-2.3.1/prometheus.yml
Restart=on-failure
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
[Install]
WantedBy=multi-user.target
# 启动
$ systemctl enable prometheus && systemctl start prometheus

常规部署

1
2
3
4
5
6
# prometheus官网有下载地址
$ cd /opt
$ wget https://github.com/prometheus/prometheus/releases/download/v2.3.2/prometheus-2.3.2.linux-amd64.tar.gz
$ tar -xvf prometheus-2.3.2.linux-amd64.tar.gz
# 重命名
$ mv prometheus-2.3.2.linux-amd64 prometheus-2.3.2

Docker部署

Ansible-playbook部署

Test Prometheus

关联组件

坚持原创技术分享,您的支持将鼓励我继续创作!
Fork me on GitHub