greenplum_fdw在postgres_fdw的基础上,针对MPP的场景,做了一些针对性的优化,是postgres_fdw的MPP版本。greenplum_fdw使用GP数据库并行cursor来直接从远程GP集群的数据节点直接并行取数据到本地的GP集群,在GP7之后,支持了谓词下推,对于JOINS
,aggregate functions(count(),max(),avg()等)
,ORDER BY
,LIMIT
等操作可以在适当情况下直接在远程GP服务器执行。
准备了2套GP数据库集群环境,1套PG数据库环境用于后续的测试,如下所示:
编号 | 环境 | 位置 |
---|---|---|
1 | GP7 | 远程 |
2 | GP7 | 本地 |
3 | PG12 | 本地 |
在远程GP7环境创建测试数据库和测试表
CREATE USER fdw PASSWORD 'fdw';
CREATE DATABASE fdw WITH OWNER fdw;
造一些测试数据
CREATE TABLE fdw_test_aggregate (id int, name text);
insert into fdw_test_aggregate select seq, md5(seq::text) from generate_series(1, 10000000) as seq;
在本地PG12的环境中安装postgres_fdw
CREATE EXTENSION postgres_fdw;
创建本地PG到远程GP的服务器和用户映射
CREATE SERVER server_remote_gp FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '10.197.166.177', port '5432', dbname 'fdw');
CREATE USER MAPPING FOR PUBLIC SERVER server_remote_gp OPTIONS (user 'fdw', password 'fdw');
创建外部表映射到远程GP的fdw_test_aggregate
CREATE FOREIGN TABLE f_test_aggregate(id int, name text) SERVER server_remote_gp OPTIONS (schema_name 'public', table_name 'fdw_test_aggregate');
查看PG查询这个外部表的执行计划,这里的返回行数是1行,返回的是一个聚合的值,总执行时间是394ms。
postgres=# EXPLAIN ANALYZE SELECT count(*) FROM f_test_aggregate ;
QUERY PLAN
---------------------------------------------------------------------------
-----------------------
Foreign Scan (cost=108.53..152.69 rows=1 width=8) (actual time=281.823..2
81.826 rows=1 loops=1)
Relations: Aggregate on (public.f_test_aggregate)
Planning Time: 0.184 ms
Execution Time: 394.695 ms
(4 rows)
打开timing后的执行时间是287ms
postgres=# \timing on
Timing is on.
postgres=# SELECT count(*) FROM f_test_aggregate ;
count
----------
10000000
(1 row)
Time: 287.962 ms
在本地的GP环境安装greenplum_fdw
插件
CREATE EXTENSION greenplum_fdw;
创建本地GP到远程GP的服务器和用户映射
CREATE SERVER server_remote_gp FOREIGN DATA WRAPPER greenplum_fdw OPTIONS (host '10.197.166.177', port '5432', dbname 'fdw', mpp_execute 'all segments', num_segments '4');
CREATE USER MAPPING FOR PUBLIC SERVER server_remote_gp OPTIONS (user 'fdw', password 'fdw');
创建外部表映射到远程GP的fdw_test_aggregate
CREATE FOREIGN TABLE f_test_aggregate(id int, name text) SERVER server_remote_gp OPTIONS (schema_name 'public', table_name 'fdw_test_aggregate');
在/etc/hosts
里配置远程GP的segments别名
10.197.166.177 mdw
10.197.166.178 sdw1
10.197.166.179 sdw2
查看本地GP查询这个外部表的执行计划,可以看到开了4个并行去远程GP的Segments节点拉取数据,将数据拉取到本地后聚合。
postgres=# EXPLAIN ANALYZE SELECT count(*) FROM f_test_aggregate ;
QUERY PLAN
---------------------------------------------------------------------------
--------------------------------------------------------------
Aggregate (cost=0.00..431.00 rows=1 width=8) (actual time=4840.883..4840.
891 rows=1 loops=1)
-> Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..431.00 rows=1
width=1) (actual time=61.096..4204.340 rows=10000000 loops=1)
-> Foreign Scan on f_test_aggregate (cost=0.00..431.00 rows=1 wi
dth=1) (actual time=0.685..4489.799 rows=2501355 loops=1)
Relations:
Optimizer: GPORCA
Planning Time: 11.424 ms
(slice0) Executor memory: 119K bytes.
(slice1) Executor memory: 43K bytes avg x 4 workers, 43K bytes max (s
eg0).
Memory used: 128000kB
Execution Time: 5033.470 ms
(10 rows)
打开timing后的执行时间是4631ms
postgres=# SELECT count(*) FROM f_test_aggregate ;
count
----------
10000000
(1 row)
Time: 4631.092 ms (00:04.631)
可以看到,在执行count这类聚合操作时,使用postgres_fdw可以直接从远程GP节点获取聚合后的数据,而greenplum_fdw则会从远程GP的segments节点(当创建server时指定mpp_execute 'all segments'
时)拉取数据到本地后进行聚合。
在PG数据库中COPY远程GP的外部表,可以看到,用时21.56秒
postgres=# \timing on
Timing is on.
postgres=# COPY (select * from f_test_aggregate) TO '/tmp/t_copy.csv' WITH CSV HEADER;
COPY 10000000
Time: 21562.885 ms (00:21.563)
在本地GP数据库中COPY远程GP的外部表,开4个并行,可以看到,仅用时6.4秒
postgres=# \timing on
Timing is on.
postgres=# COPY (select * from f_test_aggregate) TO '/tmp/t_copy.csv' WITH CSV HEADER;
COPY 10000000
Time: 6419.414 ms (00:06.419)
greenplum_fdw在postgres_fdw的基础上增加了并行的能力,在需要拉取大量数据的场景下,会有显著优势。由于greenplum_fdw会拉取数据到本地再聚合,所以对于count, sum等聚合操作性能不佳。