CLup的MMN架构的使用
1. CLup的MMN架构
CLup的MMN架构的介绍和安装请参考:部署CLup的MMN架构集群
这里主要其做一些测试。
演示环境如下:
序号 | IP | 角色 |
---|---|---|
1 | 10.198.170.11 | primary |
2 | 10.198.170.12 | standby |
3 | 10.198.170.13 | standby |
2. MMN架构验证
2.1 单节点故障
关闭当前CLup集群主库所在机器10.198.170.11,登录任意一台后备节点查看CLup日志
cd /opt/clup/logs
tail -f clupserver.log
2023-07-20 17:46:57,686 ERROR Find primary clupmdb(10.198.170.11) is down(1 times).
2023-07-20 17:47:07,698 ERROR Find primary clupmdb(10.198.170.11) is down(2 times).
2023-07-20 17:47:17,705 ERROR Find primary clupmdb(10.198.170.11) is down(3 times).
2023-07-20 17:47:27,716 ERROR Find primary clupmdb(10.198.170.11) is down(4 times).
...
2023-07-20 17:48:24,769 INFO Receive message: clupmdb(10.198.170.13) promote to primary.
2023-07-20 17:48:26,466 INFO close old db pool(10.198.170.11)
2023-07-20 17:48:26,467 INFO create db pool(10.198.170.13)
2023-07-20 17:48:27,761 INFO My sr(10.198.170.12) to primary(10.198.170.13) lost, start repairing..
2023-07-20 17:48:27,769 INFO Check clupmdb(10.198.170.12) to primary(10.198.170.13) sr status ..
2023-07-20 17:48:37,873 INFO clupmdb(10.198.170.12) to primary(10.198.170.13) sr status is None
2023-07-20 17:48:37,873 INFO prepare clupmdb(10.198.170.12) standby config.
2023-07-20 17:48:37,875 INFO Begin restart clupmdb(10.198.170.12).
2023-07-20 17:48:38,135 INFO Recheck clupmdb(10.198.170.12) to primary(10.198.170.13) sr status
2023-07-20 17:48:38,531 INFO not need rebuild, set clupmdb(10.198.170.12) upper node to 10.198.170.13 successfully!
2023-07-20 17:48:38,531 INFO Change my clupmdb(10.198.170.12) up primary to 10.198.170.13 successfully
CLup检测到CLup集群当前主库异常,之后选择一个后备节点提升为主库,上面是将10.198.170.13提升为主库。
我们打开10.198.170.12:8080,登录后查看总览,可以看到当前CLup的主库已经切换到10.198.170.13上,而10.198.170.11变成一个后备节点并被标记为down。
重新启动机器10.198.170.11,检查CLup服务是否已启动
# 10.198.170.11
systemctl status clup
# 如果服务未正常启动,则手工启动服务
systemctl start clup
我们再次查看总览,可以发现10.198.170.11的CLup状态依旧异常
此时我们可以查看CLup的日志,可以发现10.198.170.11的clupmdb被重新搭建了,成为了10.198.170.13的一个备库
# 任一节点
tail -f /opt/clup/conf/clupserver.log
2023-07-20 18:11:24,698 INFO clupmdb(10.198.170.11) to primary(10.198.170.13) sr status is startup
2023-07-20 18:11:24,698 INFO clupmdb(10.198.170.11) sr status is not normal, so must rebuild it.
2023-07-20 18:11:24,699 INFO Begin rebuild standby clupmdb(10.198.170.11) from primary(10.198.170.13)..
2023-07-20 18:11:24,699 INFO Stop clupmdb(10.198.170.11)
2023-07-20 18:11:24,848 INFO clupmdb(10.198.170.11) is stopped
2023-07-20 18:11:25,116 INFO Waiting for determine who is primary, please wait..
2023-07-20 18:11: 25,152 INFO Use pg_basebackup rebuild clupmdb(10.198.170.11)..
2023-07-20 18:11:25, 209 INFO waiting for checkpoint
2023-07-20 18:11:30,121 INFO Waiting for determine who is primary, please wait.
2023-07-20 18:11:30,956 INFO 0/727280 kB (0%), 0/1 tablespace
...
2023-07-20 18:12:11,440 INFO 727312/727312 kB (100%), 0/1 tablespace
2023-07-20 18:12:13,342 INFO Start clupmdb(10.198.170.11)
2023-07-20 18:12:13,695 INFO Recheck clupmdb(10.198.170.11) to primary(10.198.170.13) sr status
2023-07-20 18:12:13,707 INFO clupmdb(10.198.170.11) to primary(10.198.170.13) sr status is streaning
2023-07-20 18:12:13,708 INFO Rebuild standby clupmdb(10.198.170.11) from upper node(10.198.170.13) successfully
再打开10.198.170.11:8080,点击总览查看,可以发现10.198.170.11的clup集群状态也变为正常了。
2.2 多节点故障:一主一备
现在将10.198.170.11和10.198.170.13(primary)机器停掉,在10.198.170.12上查看日志
cd /opt/clup/logs/clupserver.log
2023-07-21 09:20:21,814 ERROR Find primary clupmdb(10.198.170.13) is down(1 times).
2023-07-21 09:20:21,814 ERROR Find primary clupmdb(10.198.170.13) is down(2 times).
2023-07-21 09:20:34,823 ERROR Find primary clupmdb(10.198.170.13) is down(3 times).
2023-07-21 09:20:45,569 ERROR Find primary clupmdb(10.198.170.13) is down(4 times)
2023-07-21 09:20:58,581 ERROR Find primary clupmdb(10.198.170.13) is down(5 times).
2023-07-21 09:21:10,586 ERROR Find primary clupmdb(10.198.170.13) is down(6 times)
2023-07-21 09:21:23,597 ERROR Find primary clupmdb(10.198.170.13) is down(7 times)
2023-07-21 09:21:35,603 ERROR Find primary clupmdb(10.198.170.13) is down(8 times)
2023-07-21 09:21:48,615 ERROR Find primary clupmdb(10.198.170.13) is down(9 times).
2023-07-21 09:22:00,617 ERROR Find primary clupmdb(10.198.170.13) is down(10 times)
2023-07-21 09:22:00,617 ERROR Clupmdb(10.198.170.13) is down
2023-07-21 09:22:00,619 ERROR Can not connect clup(10.198.170.11) for get elect lock(try again later): Can not connect 10.198.170.11: [Errno 11] Connection refused
此时CLup集群由于有两个节点都坏掉了,并且包含primary节点,导致CLup无法正常工作,此种情况需要及时修复。
我们启动10.198.170.13(primary)机器,检查CLup服务是否已启动
systemctl status clup
# 如果服务未正常启动,则手工启动服务
systemctl start clup
稍等片刻,我们登录到10.198.170.13:8080上查看当前CLup集群的状态
此时可以看见12和13这两台机器都正常了,管理页面也都可以正常访问。如果13(primary)无法正常修复,则可以联系我们,进行紧急修复。
再次启动11机器,稍等片刻,在总览中可以看见11机器状态也会恢复正常
2.3 多节点故障:两个备节点
现在我们停止11和12这两台备节点机器,在13机器上查看CLup日志
cd /opt/clup/logs
tail -f clupserver.log
2023-07-21 09:39:13,817 ERROR Find two clupmdb is down(1 times), recheck..
2023-07-21 09:39:27,841 ERROR Find two clupmdb is down(2 times), recheck.
2023-07-21 09:39:40,857 ERROR Find two clupmdb is down(3 times), recheck..
2023-07-21 09:39:53,871 ERROR Find two clupmdb is down(4 times), recheck
...
登录10.198.170.13:8080,点击总览查看CLup的状态
此时11和12两个备节点的状态都是down,但是管理页面可以正常访问,不影响使用。
再将11和12两台机器开机,检查CLup服务是否已启动
systemctl status clup
# 如果服务未正常启动,则手工启动服务
systemctl start clup
稍等片刻,可以发现三台机器都恢复正常