首页
产品
CLup:PostgreSQL高可用集群平台 CMiner: PostgreSQL中的CDC CData高性能数据库云一体机 CBackup数据库备份恢复云平台 CPDA高性能双子星数据库机 CSYun超融合虚拟机产品 ZQPool数据库连接池 ConshGuard数据保护产品
解决方案
数据库专业技术服务全栈式PostgreSQL解决方案Oracle分布式存储化数据库云PolarDB一体化解决方案
文章
客户及伙伴
中启开源
关于我们
公司简介 联系我们
中启开源
修改标题
往前插入
往后插入
删除

CLup自身高可用

自身高可用说明

CLup 自身也支持一主两备的高可用模式,将CLup服务部署到三台机器上(通常是不同机房),高可用场景如下:

序号 故障节点 存活节点 服务是否可用 备注
1 1备 1主1备 备节点重新启动服务后会自动修复
2 2备 1主 备节点重新启动服务后会自动修复
3 1主 2备 会自动选择一个备节点提升为主
4 1主1备 1备 如果主节点机器没有问题,开机后启动CLup服务即可恢复

演示环境如下:

序号 IP 角色
1 10.198.170.11 primary
2 10.198.170.12 standby
3 10.198.170.13 standby

CLup 高可用部署

  1. 先在主节点上部署CLup:部署方法参照 CLup5.x产品手册:快速安装

  2. 停止主节点的CLup服务

    1. systemctl stop clup
  3. 修改配置文件

    1. cd /opt/clup/conf
    2. # 修改clup_host_list,没有的话则添加这个参数
    3. clup_host_list = 10.198.170.11,10.198.170.12,10.198.170.13
  4. 在两台备节点机器上部署CLup:部署方法参照 CLup5.x产品手册:快速安装

  5. 停止备节点的CLup服务

    1. systemctl stop clup
  6. 拷贝配置文件到备节点

    1. scp /opt/clup/conf/clup.conf root@10.198.170.12:/opt/clup/conf
    2. scp /opt/clup/conf/clup.conf root@10.198.170.13:/opt/clup/conf
  7. 启动主节点上的CLup服务,同时启动备节点上的CLup服务

    1. # 10.198.170.11 此为当前主节点,应先启动
    2. systemctl start clup
    3. # 启动10.198.170.12 10.198.170.13上的CLUP服务
    4. systemctl start clup
  8. 验证,稍等片刻后再打开web页面10.198.170.11:8080,点击总览,出现下面红色线框圈选的区域即为成功

高可用验证

单节点故障

关闭当前CLup集群主库所在机器10.198.170.11,登录任意一台后备节点查看CLup日志

  1. cd /opt/clup/logs
  2. tail -f clupserver.log
  3. 2023-07-20 17:46:57,686 ERROR Find primary clupmdb(10.198.170.11) is down(1 times).
  4. 2023-07-20 17:47:07,698 ERROR Find primary clupmdb(10.198.170.11) is down(2 times).
  5. 2023-07-20 17:47:17,705 ERROR Find primary clupmdb(10.198.170.11) is down(3 times).
  6. 2023-07-20 17:47:27,716 ERROR Find primary clupmdb(10.198.170.11) is down(4 times).
  7. ...
  8. 2023-07-20 17:48:24,769 INFO Receive message: clupmdb(10.198.170.13) promote to primary.
  9. 2023-07-20 17:48:26,466 INFO close old db pool(10.198.170.11)
  10. 2023-07-20 17:48:26,467 INFO create db pool(10.198.170.13)
  11. 2023-07-20 17:48:27,761 INFO My sr(10.198.170.12) to primary(10.198.170.13) lost, start repairing..
  12. 2023-07-20 17:48:27,769 INFO Check clupmdb(10.198.170.12) to primary(10.198.170.13) sr status ..
  13. 2023-07-20 17:48:37,873 INFO clupmdb(10.198.170.12) to primary(10.198.170.13) sr status is None
  14. 2023-07-20 17:48:37,873 INFO prepare clupmdb(10.198.170.12) standby config.
  15. 2023-07-20 17:48:37,875 INFO Begin restart clupmdb(10.198.170.12).
  16. 2023-07-20 17:48:38,135 INFO Recheck clupmdb(10.198.170.12) to primary(10.198.170.13) sr status
  17. 2023-07-20 17:48:38,531 INFO not need rebuild, set clupmdb(10.198.170.12) upper node to 10.198.170.13 successfully!
  18. 2023-07-20 17:48:38,531 INFO Change my clupmdb(10.198.170.12) up primary to 10.198.170.13 successfully

CLup检测到CLup集群当前主库异常,之后选择一个后备节点提升为主库,上面是将10.198.170.13提升为主库。

我们打开10.198.170.12:8080,登录后查看总览,可以看到当前CLup的主库已经切换到10.198.170.13上,而10.198.170.11变成一个后备节点并被标记为down。

重新启动机器10.198.170.11,检查CLup服务是否已启动

  1. # 10.198.170.11
  2. systemctl status clup
  3. # 如果服务未正常启动,则手工启动服务
  4. systemctl start clup

我们再次查看总览,可以发现10.198.170.11的CLup状态依旧异常

此时我们可以查看CLup的日志,可以发现10.198.170.11的clupmdb被重新搭建了,成为了10.198.170.13的一个备库

  1. # 任一节点
  2. tail -f /opt/clup/conf/clupserver.log
  3. 2023-07-20 18:11:24,698 INFO clupmdb(10.198.170.11) to primary(10.198.170.13) sr status is startup
  4. 2023-07-20 18:11:24,698 INFO clupmdb(10.198.170.11) sr status is not normal, so must rebuild it.
  5. 2023-07-20 18:11:24,699 INFO Begin rebuild standby clupmdb(10.198.170.11) from primary(10.198.170.13)..
  6. 2023-07-20 18:11:24,699 INFO Stop clupmdb(10.198.170.11)
  7. 2023-07-20 18:11:24,848 INFO clupmdb(10.198.170.11) is stopped
  8. 2023-07-20 18:11:25,116 INFO Waiting for determine who is primary, please wait..
  9. 2023-07-20 18:11: 25,152 INFO Use pg_basebackup rebuild clupmdb(10.198.170.11)..
  10. 2023-07-20 18:11:25, 209 INFO waiting for checkpoint
  11. 2023-07-20 18:11:30,121 INFO Waiting for determine who is primary, please wait.
  12. 2023-07-20 18:11:30,956 INFO 0/727280 kB (0%), 0/1 tablespace
  13. ...
  14. 2023-07-20 18:12:11,440 INFO 727312/727312 kB (100%), 0/1 tablespace
  15. 2023-07-20 18:12:13,342 INFO Start clupmdb(10.198.170.11)
  16. 2023-07-20 18:12:13,695 INFO Recheck clupmdb(10.198.170.11) to primary(10.198.170.13) sr status
  17. 2023-07-20 18:12:13,707 INFO clupmdb(10.198.170.11) to primary(10.198.170.13) sr status is streaning
  18. 2023-07-20 18:12:13,708 INFO Rebuild standby clupmdb(10.198.170.11) from upper node(10.198.170.13) successfully

再打开10.198.170.11:8080,点击总览查看,可以发现10.198.170.11的clup集群状态也变为正常了。

多节点故障:一主一备

现在将10.198.170.11和10.198.170.13(primary)机器停掉,在10.198.170.12上查看日志

  1. cd /opt/clup/logs/clupserver.log
  2. 2023-07-21 09:20:21,814 ERROR Find primary clupmdb(10.198.170.13) is down(1 times).
  3. 2023-07-21 09:20:21,814 ERROR Find primary clupmdb(10.198.170.13) is down(2 times).
  4. 2023-07-21 09:20:34,823 ERROR Find primary clupmdb(10.198.170.13) is down(3 times).
  5. 2023-07-21 09:20:45,569 ERROR Find primary clupmdb(10.198.170.13) is down(4 times)
  6. 2023-07-21 09:20:58,581 ERROR Find primary clupmdb(10.198.170.13) is down(5 times).
  7. 2023-07-21 09:21:10,586 ERROR Find primary clupmdb(10.198.170.13) is down(6 times)
  8. 2023-07-21 09:21:23,597 ERROR Find primary clupmdb(10.198.170.13) is down(7 times)
  9. 2023-07-21 09:21:35,603 ERROR Find primary clupmdb(10.198.170.13) is down(8 times)
  10. 2023-07-21 09:21:48,615 ERROR Find primary clupmdb(10.198.170.13) is down(9 times).
  11. 2023-07-21 09:22:00,617 ERROR Find primary clupmdb(10.198.170.13) is down(10 times)
  12. 2023-07-21 09:22:00,617 ERROR Clupmdb(10.198.170.13) is down
  13. 2023-07-21 09:22:00,619 ERROR Can not connect clup(10.198.170.11) for get elect lock(try again later): Can not connect 10.198.170.11: [Errno 11] Connection refused

此时CLup集群由于有两个节点都坏掉了,并且包含primary节点,导致CLup无法正常工作,此种情况需要及时修复。

我们启动10.198.170.13(primary)机器,检查CLup服务是否已启动

  1. systemctl status clup
  2. # 如果服务未正常启动,则手工启动服务
  3. systemctl start clup

稍等片刻,我们登录到10.198.170.13:8080上查看当前CLup集群的状态

此时可以看见12和13这两台机器都正常了,管理页面也都可以正常访问。如果13(primary)无法正常修复,则可以联系我们,进行紧急修复。

再次启动11机器,稍等片刻,在总览中可以看见11机器状态也会恢复正常

多节点故障:两个备节点

现在我们停止11和12这两台备节点机器,在13机器上查看CLup日志

  1. cd /opt/clup/logs
  2. tail -f clupserver.log
  3. 2023-07-21 09:39:13,817 ERROR Find two clupmdb is down(1 times), recheck..
  4. 2023-07-21 09:39:27,841 ERROR Find two clupmdb is down(2 times), recheck.
  5. 2023-07-21 09:39:40,857 ERROR Find two clupmdb is down(3 times), recheck..
  6. 2023-07-21 09:39:53,871 ERROR Find two clupmdb is down(4 times), recheck
  7. ...

登录10.198.170.13:8080,点击总览查看CLup的状态

此时11和12两个备节点的状态都是down,但是管理页面可以正常访问,不影响使用。

再将11和12两台机器开机,检查CLup服务是否已启动

  1. systemctl status clup
  2. # 如果服务未正常启动,则手工启动服务
  3. systemctl start clup

稍等片刻,可以发现三台机器都恢复正常

目录
img