首页
解决方案
数据库专业技术服务全栈式PostgreSQL解决方案Oracle分布式存储化数据库云PolarDB一体化解决方案
产品
CLup:PostgreSQL高可用集群平台 CMiner: PostgreSQL中的CDC CData高性能数据库云一体机 CBackup数据库备份恢复云平台 CPDA高性能双子星数据库机 CSYun超融合虚拟机产品 ZQPool数据库连接池 ConshGuard数据保护产品
文档
文章
客户及伙伴
中启开源
关于我们
公司简介 联系我们
中启开源
修改标题
往前插入
往后插入
删除

CLup的MMN架构的使用

1. CLup的MMN架构

CLup的MMN架构的介绍和安装请参考:部署CLup的MMN架构集群

这里主要其做一些测试。

演示环境如下:

序号 IP 角色
1 10.198.170.11 primary
2 10.198.170.12 standby
3 10.198.170.13 standby

2. MMN架构验证

2.1 单节点故障

关闭当前CLup集群主库所在机器10.198.170.11,登录任意一台后备节点查看CLup日志

  1. cd /opt/clup/logs
  2. tail -f clupserver.log
  3. 2023-07-20 17:46:57,686 ERROR Find primary clupmdb(10.198.170.11) is down(1 times).
  4. 2023-07-20 17:47:07,698 ERROR Find primary clupmdb(10.198.170.11) is down(2 times).
  5. 2023-07-20 17:47:17,705 ERROR Find primary clupmdb(10.198.170.11) is down(3 times).
  6. 2023-07-20 17:47:27,716 ERROR Find primary clupmdb(10.198.170.11) is down(4 times).
  7. ...
  8. 2023-07-20 17:48:24,769 INFO Receive message: clupmdb(10.198.170.13) promote to primary.
  9. 2023-07-20 17:48:26,466 INFO close old db pool(10.198.170.11)
  10. 2023-07-20 17:48:26,467 INFO create db pool(10.198.170.13)
  11. 2023-07-20 17:48:27,761 INFO My sr(10.198.170.12) to primary(10.198.170.13) lost, start repairing..
  12. 2023-07-20 17:48:27,769 INFO Check clupmdb(10.198.170.12) to primary(10.198.170.13) sr status ..
  13. 2023-07-20 17:48:37,873 INFO clupmdb(10.198.170.12) to primary(10.198.170.13) sr status is None
  14. 2023-07-20 17:48:37,873 INFO prepare clupmdb(10.198.170.12) standby config.
  15. 2023-07-20 17:48:37,875 INFO Begin restart clupmdb(10.198.170.12).
  16. 2023-07-20 17:48:38,135 INFO Recheck clupmdb(10.198.170.12) to primary(10.198.170.13) sr status
  17. 2023-07-20 17:48:38,531 INFO not need rebuild, set clupmdb(10.198.170.12) upper node to 10.198.170.13 successfully!
  18. 2023-07-20 17:48:38,531 INFO Change my clupmdb(10.198.170.12) up primary to 10.198.170.13 successfully

CLup检测到CLup集群当前主库异常,之后选择一个后备节点提升为主库,上面是将10.198.170.13提升为主库。

我们打开10.198.170.12:8080,登录后查看总览,可以看到当前CLup的主库已经切换到10.198.170.13上,而10.198.170.11变成一个后备节点并被标记为down。

重新启动机器10.198.170.11,检查CLup服务是否已启动

  1. # 10.198.170.11
  2. systemctl status clup
  3. # 如果服务未正常启动,则手工启动服务
  4. systemctl start clup

我们再次查看总览,可以发现10.198.170.11的CLup状态依旧异常

此时我们可以查看CLup的日志,可以发现10.198.170.11的clupmdb被重新搭建了,成为了10.198.170.13的一个备库

  1. # 任一节点
  2. tail -f /opt/clup/conf/clupserver.log
  3. 2023-07-20 18:11:24,698 INFO clupmdb(10.198.170.11) to primary(10.198.170.13) sr status is startup
  4. 2023-07-20 18:11:24,698 INFO clupmdb(10.198.170.11) sr status is not normal, so must rebuild it.
  5. 2023-07-20 18:11:24,699 INFO Begin rebuild standby clupmdb(10.198.170.11) from primary(10.198.170.13)..
  6. 2023-07-20 18:11:24,699 INFO Stop clupmdb(10.198.170.11)
  7. 2023-07-20 18:11:24,848 INFO clupmdb(10.198.170.11) is stopped
  8. 2023-07-20 18:11:25,116 INFO Waiting for determine who is primary, please wait..
  9. 2023-07-20 18:11: 25,152 INFO Use pg_basebackup rebuild clupmdb(10.198.170.11)..
  10. 2023-07-20 18:11:25, 209 INFO waiting for checkpoint
  11. 2023-07-20 18:11:30,121 INFO Waiting for determine who is primary, please wait.
  12. 2023-07-20 18:11:30,956 INFO 0/727280 kB (0%), 0/1 tablespace
  13. ...
  14. 2023-07-20 18:12:11,440 INFO 727312/727312 kB (100%), 0/1 tablespace
  15. 2023-07-20 18:12:13,342 INFO Start clupmdb(10.198.170.11)
  16. 2023-07-20 18:12:13,695 INFO Recheck clupmdb(10.198.170.11) to primary(10.198.170.13) sr status
  17. 2023-07-20 18:12:13,707 INFO clupmdb(10.198.170.11) to primary(10.198.170.13) sr status is streaning
  18. 2023-07-20 18:12:13,708 INFO Rebuild standby clupmdb(10.198.170.11) from upper node(10.198.170.13) successfully

再打开10.198.170.11:8080,点击总览查看,可以发现10.198.170.11的clup集群状态也变为正常了。

2.2 多节点故障:一主一备

现在将10.198.170.11和10.198.170.13(primary)机器停掉,在10.198.170.12上查看日志

  1. cd /opt/clup/logs/clupserver.log
  2. 2023-07-21 09:20:21,814 ERROR Find primary clupmdb(10.198.170.13) is down(1 times).
  3. 2023-07-21 09:20:21,814 ERROR Find primary clupmdb(10.198.170.13) is down(2 times).
  4. 2023-07-21 09:20:34,823 ERROR Find primary clupmdb(10.198.170.13) is down(3 times).
  5. 2023-07-21 09:20:45,569 ERROR Find primary clupmdb(10.198.170.13) is down(4 times)
  6. 2023-07-21 09:20:58,581 ERROR Find primary clupmdb(10.198.170.13) is down(5 times).
  7. 2023-07-21 09:21:10,586 ERROR Find primary clupmdb(10.198.170.13) is down(6 times)
  8. 2023-07-21 09:21:23,597 ERROR Find primary clupmdb(10.198.170.13) is down(7 times)
  9. 2023-07-21 09:21:35,603 ERROR Find primary clupmdb(10.198.170.13) is down(8 times)
  10. 2023-07-21 09:21:48,615 ERROR Find primary clupmdb(10.198.170.13) is down(9 times).
  11. 2023-07-21 09:22:00,617 ERROR Find primary clupmdb(10.198.170.13) is down(10 times)
  12. 2023-07-21 09:22:00,617 ERROR Clupmdb(10.198.170.13) is down
  13. 2023-07-21 09:22:00,619 ERROR Can not connect clup(10.198.170.11) for get elect lock(try again later): Can not connect 10.198.170.11: [Errno 11] Connection refused

此时CLup集群由于有两个节点都坏掉了,并且包含primary节点,导致CLup无法正常工作,此种情况需要及时修复。

我们启动10.198.170.13(primary)机器,检查CLup服务是否已启动

  1. systemctl status clup
  2. # 如果服务未正常启动,则手工启动服务
  3. systemctl start clup

稍等片刻,我们登录到10.198.170.13:8080上查看当前CLup集群的状态

此时可以看见12和13这两台机器都正常了,管理页面也都可以正常访问。如果13(primary)无法正常修复,则可以联系我们,进行紧急修复。

再次启动11机器,稍等片刻,在总览中可以看见11机器状态也会恢复正常

2.3 多节点故障:两个备节点

现在我们停止11和12这两台备节点机器,在13机器上查看CLup日志

  1. cd /opt/clup/logs
  2. tail -f clupserver.log
  3. 2023-07-21 09:39:13,817 ERROR Find two clupmdb is down(1 times), recheck..
  4. 2023-07-21 09:39:27,841 ERROR Find two clupmdb is down(2 times), recheck.
  5. 2023-07-21 09:39:40,857 ERROR Find two clupmdb is down(3 times), recheck..
  6. 2023-07-21 09:39:53,871 ERROR Find two clupmdb is down(4 times), recheck
  7. ...

登录10.198.170.13:8080,点击总览查看CLup的状态

此时11和12两个备节点的状态都是down,但是管理页面可以正常访问,不影响使用。

再将11和12两台机器开机,检查CLup服务是否已启动

  1. systemctl status clup
  2. # 如果服务未正常启动,则手工启动服务
  3. systemctl start clup

稍等片刻,可以发现三台机器都恢复正常

目录
img