prometheus技术视图

Prometheus技术视图

Foreward

在学习每个新技术之前,都有必要弄懂以下6个问题(左耳朵耗子学习模版):

这个技术出现的背景、初衷和要达到什么样的目标或是要解决什么样的问题
这个技术的优势和劣势分别是什么,或者说这个技术的 trade-off 是什么。任何技术都有其好坏，在解决一个问题的时候，也会带来新的问题.
适用的场景(业务场景或技术场景)
它的组成部分和关键点
它的底层原理和关键实现
已有的实现和它之间的对比

由于最近在搞代理,需要一个监控和报警来保证稳定性.我的需求也很简单,监控500+主机是否存活(通过ping或ssh的方式检测),如果出现问题,立马报警.

其实官网已经说的很明白了,我也没必要重复搬运,这里只记录自己的使用和比较关注的信息.

Resources

官网
博客
- https://en.fabernovel.com/insights/tech-en/alerting-in-prometheus-or-how-i-can-sleep-well-at-night
- Yunlong blog
- https://www.slideshare.net/leecalcote/understanding-and-extending-prometheus-alertmanager
- 了解 Prometheus Federation 功能
- [使用Prometheus+grafana打造高逼格监控平台](http://blog.51cto.com/youerning/2050543
- Prometheus 非官方中文手册
- 基于Prometheus的分布式在线服务监控实践

Environment

prometheus 2.3.1
blackbox_exporter 0.12.0
alertmanager 0.15.1
centos 7

Notices

Prometheus即是一个CPU密集型（查询）也是一个IO密集型（数据落地）的，CPU数量是多多益善，内存越大越好（来缓存抓取的数据，所以应该减少不必要的业务数据导出），尽量要使用SSD（这个很关键！），因为一旦Prometheus的内存使用量达到阈值会停止抓取数据！这个停止抓取的时间，至少是分钟级，甚至是无法恢复！所以只要有条件就要用SSD.

Prometheus

Install

部署测试场景

Config prometheus.yml

blackbox_exporter和alertmanager的安装可以参见另外两篇文章《alertmanager之初体验》、《blackbox_exporter之安装》

# 由于机器数量较多,使用prometheus的服务发现,将所有机器配置到单独的文件中.
$ vim /opt/machine.json
[
 {"targets":["122.227.184.61:20065"],"labels":{"machineName":"guangzhou77d9"}},
{"targets":["122.227.184.61:20063"],"labels":{"machineName":"guangzhou77d8"}},
...
]
# 配置prometheus.yml
$ vim /opt/prometheus-2.3.1/prometheus.yml
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["10.1.1.26:9093"]  # alertmanager的地址.
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/opt/proxy_rules.yml"  # 规则文件.
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'machine_heart'
    metrics_path: /probe
    params:
      module: [ssh_banner]
    file_sd_configs:
    - files: ['/opt/machine.json']   # 机器地址列表文件
      refresh_interval: 5m
    relabel_configs:
    - source_labels: [address]
      regex: (.*)
      replacement: ${1}
      target_label: __param_target  # 请求http://10.1.1.26:9155的请求参数target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: address
      replacement: 10.1.1.26:9115  # blackbox_exporter的地址.

Config rules.yml

$ vim /opt/proxy_rules.yml
groups:
  - name: machine_heart
    rules:
    - alert: 'ssh'
      expr: probe_success{job="machine_heart"} < 1
      for: 1m
      labels:
        severity: critical
      # team: 'proxy-team'
      annotations:
        summary: "Instance {{ $labels.instance }} down."
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

Config Systemd

$ vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=prometheus.service
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
ExecStart=/opt/prometheus-2.3.1/prometheus --config.file=/opt/prometheus-2.3.1/prometheus.yml
Restart=on-failure
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
[Install]
WantedBy=multi-user.target
# 启动
$ systemctl enable prometheus && systemctl start prometheus

常规部署

# prometheus官网有下载地址
$ cd /opt
$ wget https://github.com/prometheus/prometheus/releases/download/v2.3.2/prometheus-2.3.2.linux-amd64.tar.gz
$ tar -xvf prometheus-2.3.2.linux-amd64.tar.gz
# 重命名
$ mv prometheus-2.3.2.linux-amd64 prometheus-2.3.2

prometheus技术视图

Foreward

Resources

Environment

Notices

Prometheus

Install

部署测试场景

常规部署

Docker部署

Ansible-playbook部署

Test Prometheus

关联组件