Gluster入门

Gluster技术视图

Notice

Recommend use XFS filesystem.

○ Typically, XFS is recommended but it can be used with other filesystems as well. Most commonly EXT4 is used when XFS isn’t, but you can (and many, many people do) use another filesystem that suits you.

☆ 推荐使用XFS文件系统.EXT4等其他文件系统也是可以.
Correct DNS entries (forward and reverse) and NTP are essential.

☆ DNS一般不需要特殊配置,采用默认即可.NTP,就是要求每台机器的时钟进行校对,节点机器都在同一个时区,并校对.校对方式很多,使用一致的即可.
Firewalls are great, except when they aren’t.In case you absolutely need to set up a firewall, have a look at Setting up clients for information on the ports used.

☆ 不建议Gluster节点之间开启防火墙.如果实在有必要开启防火墙,我是配置IP级别的,这样可以减少一些复杂度.
2 CPU’s, 2GB of RAM, 1GBE(千兆带宽)

☆ 服务端配置至少需要这种配置

客户端

Gluster Native Client

○ The Gluster Native Client is a FUSE-based client running in user space. Gluster Native Client is the recommended method for accessing volumes when high concurrency and high write performance is required.

☆ 推荐使用这种方式,其基于内核提供的FUSE,在高并发、大数据量写入时效果更好.

NFS Client

Foreward

What

○ GlusterFS is a scalable network filesystem suitable for data-intensive tasks such as cloud storage and media streaming. GlusterFS is free and open source software and can utilize common off-the-shelf hardware.

GlusterFS isn’t really a filesystem in and of itself. It concatenates existing filesystems into one (or more) big chunks so that data being written into or read out of Gluster gets distributed across multiple hosts simultaneously

☆ Gluster是一个开源的、可扩展的、分布式数据存储管理软件.其并不是一个文件系统,只是提供连接能力,将分布的文件系统组装成一个更大的文件存储系统.

Concept

TSP

○ A trusted storage pool(TSP) is a trusted network of storage servers. Before you can configure a GlusterFS volume, you must create a trusted storage pool of the storage servers that will provide bricks to the volume by peer probing the servers. The servers in a TSP are peers of each other.

☆ Gluster通过TSP(信任存储池)来确定可提供存储服务的机器有哪些.
Brick

○ A brick is used to refer to any device (really this means filesystem) that is being used for Gluster storage.

☆ Gluster使用的存储单位.在linux系统中,常用fdisk -l来查看挂载的磁盘,而brick对Gluster来说,就是它的挂载的磁盘.只不过这里将brick与磁盘进行一个bind,一一映射.
Gluster volume

○ A Gluster volume is a collection of one or more bricks (of course, typically this is two or more). This is analogous(类似的) to /etc/exports entries for NFS.

☆ Brick的集合.
Global Namespace

○ The term Global Namespace is a fancy way of saying a Gluster volume.

☆ 对Cluster volume的另一种叫法.
Export

○ An export refers to the mount path of the brick(s) on a given server, for example, /export/brick1.

☆ 暂时不能理解,待补充.
GNFS and kNFS

○ GNFS is how we refer to our inline NFS server. kNFS stands for kernel NFS, or, as most people would say, just plain NFS. Most often, you will want kNFS services disabled on the Gluster nodes. Gluster NFS doesn’t take any additional configuration and works just like you would expect with NFSv3. It is possible to configure Gluster and NFS to live in harmony if you want to.

☆ Gluster内部的NFS服务,启动好Glusterd Daemon后,通过GNFS其他Gluster进行数据交互.kNFS是Linux系统内核的NFS服务.两者不会互相干扰,可以共同使用.

How

Install

Centos系统可参考<<Gluster安装>>.其他系统安装.

System Packaged version

各系统的安装包版本及依赖包.Packages

Manage Trust Storage Pool

○ The firewall on the servers must be configured to allow access to port 24007.

☆ 存储节点使用24007端口进行通信

假设有4台机器,server1,server2,server3,server4在一个TSP中.

Add to trust storage pool

在任意一台机器上机器上执行.

gluster peer probe

1 2	$ gluster peer probe <server> Probe successful

List Servers

在任意一台机器上执行.

gluster pool list

# 假设在server1上执行.
$ gluster pool list
UUID                                    Hostname        State
d18d36c5-533a-4541-ac92-c471241d5418    localhost       Connected
5e987bda-16dd-43c2-835b-08b7d55e94e5    server2         Connected
1e0ca3aa-9ef7-4f66-8f15-cbc348f29ff7    server3         Connected
3e0cabaa-9df7-4f66-8e5d-cbc348f29ff7    server4         Connected

Views peer status

gluster peer status

# 假设在server1上执行.
$ gluster peer status
Hostname: server2
Uuid: 5e987bda-16dd-43c2-835b-08b7d55e94e5
State: Peer in Cluster (Connect
Hostname: server3
Uuid: 1e0ca3aa-9ef7-4f66-8f15-cbc348f29ff7
State: Peer in Cluster (Connect
Hostname: server4
Uuid: 3e0cabaa-9df7-4f66-8e5d-cbc348f29ff7
State: Peer in Cluster (Connected)

Removing Servers

1	# gluster peer detach <server>

Brick Naming Convertions

/data/glusterfs///brick

是对linux磁盘绑定起的别名,如系统中/dev/sdb磁盘,我们绑定后,可以命名为test(环境使用类型),这样可区分所属环境.

就可以任意命名了,我是通过业务进行区分.如es表示搜索引擎业务使用,logs表示日志使用.不同的业务可能对磁盘性能可能也是不一样的,可以通过多个磁盘分出来.

要搞明白Brick的命名规范,就需要先理解brick的概念.在linux系统中,常用fdisk -l来查看挂载的磁盘,而brick就Gluster来说,就是它的挂载的磁盘.只不过这里将brick与磁盘进行一个bind,一一映射.

举个例子:

比如一块物理磁盘/dev/sdb,现在我用于测试环境中,用户通常的业务,存储一些日志等.

$ mkdir -p /data/glusterfs/test/biz
$ mount /dev/sdb /data/glusterfs/test/biz
$ gluster volume create test replica 2 server{1..4}:/data/glusterfs/test/biz/brick
# 如果要启用还要start
$ gluster volume start test

这里有个疑问,为什么需要使用brick呢?

假如server1有两个磁盘/dev/sda,/dev/sdb.而server2有两个/dev/sdb1,/dev/sdb2.磁盘名称就存在不同,要通过brick去屏蔽底层磁盘的名称的不同和性能的不同.

$ mkdir -p /data/glusterfs/test/biz
# server1
$ mount /dev/sda /data/glusterfs/test/biz
# server2
$ mount /dev/sdb1 /data/glusterfs/test/biz
$ gluster volume create test replica 2 server{1..4}:/data/glusterfs/test/biz/brick

Formatting and Mounting Bricks

待完善.这里主要涉及Linux卷相关概念:lV逻辑卷,VG卷组,PV物理卷.

https://wiki.archlinux.org/index.php/LVM

https://linux.cn/article-5117-1.html

https://askubuntu.com/questions/417642/logical-volume-physical-volume-and-volume-groups

Set ACL

待完善.这里主要是Linux中ACL与Gluster的使用.

Volume Types

○ A volume is a logical collection of bricks.

以下罗列了Gluster提供的Volume类型.

Distributed - Distributed volumes distribute files across the bricks in the volume. You can use distributed volumes where the requirement is to scale storage and the redundancy is either not important or is provided by other hardware/software layers.
Replicated – Replicated volumes replicate files across bricks in the volume. You can use replicated volumes in environments where high-availability and high-reliability are critical.
Distributed Replicated - Distributed replicated volumes distribute files across replicated bricks in the volume. You can use distributed replicated volumes in environments where the requirement is to scale storage and high-reliability is critical. Distributed replicated volumes also offer improved read performance in most environments.
Dispersed - Dispersed volumes are based on erasure codes, providing space-efficient protection against disk or server failures. It stores an encoded fragment of the original file to each brick in a way that only a subset of the fragments is needed to recover the original file. The number of bricks that can be missing without losing access to data is configured by the administrator on volume creation time.

☆ 这里的关键是erasure codes算法.Erasure-Code, 简称 EC, 也叫做 擦除码 或 纠删码, 指使用范德蒙(Vandermonde) 矩阵的里德-所罗门码(Reed-Solomon) 擦除码算法.

通过较少的数据冗余能够找回丢失数据.相比于Relica百分百冗余来说,这个方式更节省空间.

推荐学习这篇文章drdr.xp Blog
Distributed Dispersed - Distributed dispersed volumes distribute files across dispersed subvolumes. This has the same advantages of distribute replicate volumes, but using disperse to store the data into the bricks.
Striped [Deprecated] 、Distributed Striped [Deprecated] 、Distributed Striped Replicated [Deprecated]、Striped Replicated [Deprecated]

☆ 上面主要有三种volume类型:Distributed, Replicated, Dispersed. 和组合后的二种:Distributed Replicated, Distributed Dispersed.

Create Command

stripe已经废弃,所以目前只有replica和disperse两种volume类型.

# gluster volume create [stripe | replica | disperse] [transport tcp | rdma | tcp,rdma]
  volume  create  <NEW-VOLNAME> [stripe <COUNT>] [replica <COUNT>] [disperse
  [<COUNT>]] [redundancy <COUNT>] [transport <tcp|rdma|tcp,rdma>] <NEW-BRICK>
  ...
  Create a new volume of the specified type using the specified bricks
  and transport type (the default transport type is tcp).  To create a
  volume   with  both  transports  (tcp  and  rdma),  give  'transport
  tcp,rdma' as an option.

Distributed

优点

节省空间,易扩展.
缺点

数据丢失风险:由于数据没有冗余,一旦机器故障数据就会丢失.

Note: Make sure you start your volumes before you try to mount them or else client operations after the mount will hang.

# gluster volume create  [transport tcp | rdma | tcp,rdma]
# If the transport type is not specified, tcp is used as the default.
$ gluster volume create test-volume server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4
$ gluster volume info
Volume Name: test-volume
Type: Distribute
Status: Created
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: server1:/exp1
Brick2: server2:/exp2
Brick3: server3:/exp3
Brick4: server4:/exp4

Replicated

优点

数据有冗余,数据丢失依然可用.
缺点

存储空间消耗较多

Note:

Make sure you start your volumes before you try to mount them or else client operations after the mount will hang.

GlusterFS will fail to create a replicate volume if more than one brick of a replica set is present on the same peer. For eg. a four node replicated volume where more than one brick of a replica set is present on the same peer.

☆ 这种Volume类型不能指定同一台机器.如

1
2

$ gluster volume create <volname> replica 4 server1:/brick1 server1:/brick2 server2:/brick3 server4:/brick4
volume create: <volname>: failed: Multiple bricks of a replicate volume are present on the same server. This setup is not optimal. Use 'force' at the end of the command if you want to override this behavior.

这里指定了server1:/brick1和server1/brick2,所以报错,不能同时冗余一份到同一台机器.

1
2
3

# gluster volume create  [replica ] [transport tcp | rdma | tcp,rdma]
# transport type is not specified, tcp is used as the default. 
$ gluster volume create test-volume replica 2 transport tcp server1:/exp1 server2:/exp2

Dispersed

优点

同replica一样,数据高可用,而且数据存储量更少.
缺点

暂未发现

○ Dispersed volumes are based on erasure codes.

☆ 基于纠偏码算法.不同于replica,冗余数据量大幅减少的情况下,依然做到数据高可用.

分布式系统中,为了保证数据高可用,一般选择副本数为3,这个可靠性的预期大约是11个9以上(99.999999999%的概率不丢数据).这里有业界报告来支撑这个数值.backblaze发布的硬盘故障率统计

# 正常可以不用指定redundancy,在创建时会提示指定数量.
# gluster volume create [disperse [<count>]] [redundancy <count>] [transport tcp | rdma | tcp,rdma]
# 如下:提示使用redundancy.
$ gluster volume create test-volume disperse 4 server{1..4}:/bricks/test-volume
There isn't an optimal redundancy value for this configuration. Do you want to create the volume with redundancy 1 ? (y/n)

Distributed Replicated

优点

数据冗余,数据丢失后依然可用.
缺点

于Replica不同的是,冗余的数据存在随机性,不便于管理.另外,存储空间消耗大.

Note: - Make sure you start your volumes before you try to mount them or else client operations after the mount will hang.

GlusterFS will fail to create a distribute replicate volume if more than one brick of a replica set is present on the same peer. For eg. for a four node distribute (replicated) volume where more than one brick of a replica set is present on the same peer.

☆ 这种类型的Volume,同样不能指定同一台机器,不然报错.

1
2

$ gluster volume create <volname> replica 4 server1:/brick1 server1:/brick2 server2:/brick3 server4:/brick4
volume create: <volname>: failed: Multiple bricks of a replicate volume are present on the same server. This setup is not optimal. Use 'force' at the end of the command if you want to override this behavior.

这里指定了server1:/brick1和server1/brick2,所以报错,不能同时冗余一份到同一台机器.

从执行指令上看,Relicated和Distributed Relicated是相同的.只是后面的节点数量多于replica的数量.

# gluster volume create [replica ] [transport tcp | rdma | tcp,rdma]
# 这个执行结果结果就是上图所示.
$ gluster volume create test-volume replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4
# 如果指定6个节点,replica为2,则会随机选取2个节点存储数据,1个节点存储原数据,1个节点冗余数据.
$ gluster volume create test-volume replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5 server6:/exp6