物化视图是一种预先计算并存储查询结果的数据结构,它可以显著加速查询响应时间。在某些场景下,物化视图需要定期更新以反映源表中的最新数据变化。CloudberryDB的增量物化视图提供了一种高效的方式来处理这种更新需求,尤其适用于那些写入较少但读取频繁且需要实时数据的场景。
增量物化视图的主要优势在于它能够在数据发生变化时仅更新受影响的部分,而不是重新计算整个视图。这有助于保持查询性能的同时,确保视图的实时性。但是这种设计也带来了一定的局限性,尤其是在写入性能方面。由于每次基表更新都会触发视图的更新,因此在写入操作较多的场景下,写入性能可能会有所下降。
官方的TPC-H测试结果如下:
| Data volume | Normal inserts | Inserts with incremental materialized view | Queries with regular view | Queries with incremental materialized view |
|---|---|---|---|---|
| 1 GB | 2712 ms | 3777 ms | 813 ms | 43 ms |
| 5 GB | 15681 ms | 29363 ms | 3040 ms | 63 ms |
| 10 GB | 43011 ms | 73188 ms | 7057 ms | 102 ms |
可以看到,增量物化视图相比于普通的视图,显著提升了查询性能,但代价是插入性能严重下滑。基于增量物化视图的实现原理,随着增量物化视图数量的增加,基表的插入性能退化倍数和增量物化视图个数成正比。
官方并没有给出随着增量物化视图数量增加,插入性能退化的相关测试数据,下面让我们来进行简单的测试。
使用下面的python脚本来测试增量物化视图数量为0个、1个、2个、3个的情况下,插入性能情况:
import psycopg2from faker import Fakerimport randomimport uuidfrom datetime import datetimeimport timeimport threading# 测试参数num_car_models = 100num_sales = 1*1000materialized_view_names = ['brand_sales', 'brand_model_sales', 'color_sales']materialized_view_sqls = ["""CREATE INCREMENTAL MATERIALIZED VIEW brand_sales ASSELECTbrand,COUNT(*) AS salesFROMcar_modelJOIN sale ON car_model.id = sale.car_modelGROUP BYbrand;""","""CREATE INCREMENTAL MATERIALIZED VIEW brand_model_sales ASSELECTbrand,model,COUNT(*) AS salesFROMcar_modelJOIN sale ON car_model.id = sale.car_modelGROUP BYbrand,model;""","""CREATE INCREMENTAL MATERIALIZED VIEW color_sales ASSELECTcolor,COUNT(*) AS salesFROMcar_modelJOIN sale ON car_model.id = sale.car_modelGROUP BYcolor;"""]# 初始化 Fakerfake = Faker()# 生成汽车品牌数据brands = ['Toyota', 'Honda', 'Ford', 'Chevrolet', 'Nissan', 'Volkswagen', 'Mercedes-Benz', 'BMW', 'Audi', 'Hyundai', 'Kia', 'Subaru', 'Mazda', 'Jeep', 'Lexus', 'Ram', 'Buick', 'GMC', 'Cadillac', 'Acura', 'Infiniti', 'Porsche', 'Volvo', 'Lincoln', 'Land Rover', 'Chrysler', 'Mitsubishi', 'Jaguar', 'Mini', 'Fiat', 'Bentley', 'Scion', 'Smart', 'Maserati', 'Ferrari', 'Tesla', 'Aston Martin', 'Rolls-Royce', 'Lamborghini', 'Lotus', 'McLaren', 'Bugatti', 'Alfa Romeo', 'Genesis', 'Karma', 'Rivian', 'Lucid', 'Polestar', 'Rivian', 'Byton', 'Faraday Future']def generate_car_model_data(num_records):global brandsfor _ in range(num_records):yield {'id': str(uuid.uuid1()),'brand': random.choice(brands),'model': fake.word(),'color': fake.color_name(),'year': random.randint(2000, 2023)}def generate_sale_data(car_models, num_records):car_models_ids = [cm['id'] for cm in car_models]for _ in range(num_records):yield {'id': str(uuid.uuid4()),'car_model': random.choice(car_models_ids),'date': fake.date_this_decade()}def recreate_table(conn, view_num):cur = conn.cursor()for name in materialized_view_names:cur.execute(f"""DROP MATERIALIZED VIEW IF EXISTS {name};""")cur.execute("""DROP TABLE IF EXISTS sale;""")cur.execute("""DROP TABLE IF EXISTS car_model;""")cur.execute("""CREATE TABLE IF NOT EXISTS car_model (id UUID PRIMARY KEY,brand TEXT,model TEXT,color TEXT,year INT);""")cur.execute("""CREATE TABLE IF NOT EXISTS sale (id UUID PRIMARY KEY,car_model UUID REFERENCES car_model(id),date DATE);""")if view_num == 0:conn.commit()cur.close()returnfor sql in materialized_view_sqls[:view_num]:cur.execute(sql)conn.commit()cur.close()def get_conn():return psycopg2.connect(dbname="your_database_name",user="your_username",password="your_password",host="your_host",port="your_port")def thread_insert_sale_data(car_models, num_sales):conn = get_conn()# conn.autocommit = Truecur = conn.cursor()sales = []for sale in generate_sale_data(car_models, num_sales):sales.append((sale['id'], sale['car_model'], sale['date']))cur.executemany("""INSERT INTO sale (id, car_model, date)VALUES (%s, %s, %s);""", sales)conn.commit()cur.close()conn.close()if __name__ == '__main__':total_times = []thread_num = 1# 进行四次测试,分别测试物化视图数量为 0, 1, 2, 3 的情况下,插入数据的性能for i in range(4):print(f"Testing with {i} materialized view(s)...")conn = get_conn()insert_time = 0recreate_table(conn, i)# 生成并插入汽车型号数据print("Generating and insert car model data...")cur = conn.cursor()car_models = list(generate_car_model_data(num_car_models))for car_model in car_models:cur.execute("""INSERT INTO car_model (id, brand, model, color, year)VALUES (%s, %s, %s, %s, %s);""", (car_model['id'], car_model['brand'], car_model['model'], car_model['color'], car_model['year']))conn.commit()cur.close()conn.close()# 生成并插入销售数据thread_pool = []print("Generating and insert sale data...")now = datetime.now()for _ in range(thread_num):thread_worker = threading.Thread(target=thread_insert_sale_data, args=(car_models, num_sales//thread_num))thread_worker.start()thread_pool.append(thread_worker)for t in thread_pool:t.join()insert_time = (datetime.now() - now).total_seconds()total_times.append(insert_time)# 打印测试结果for i, time in enumerate(total_times):print(f"Insert time with {i} materialized view(s): {time:.2f} seconds")
测试仅用于探索随着增量物化视图数量的增加,对插入性能的影响比例。测试结果如下,随着增量物化视图数量的增加,插入性能有所降低,降低多少插入性能和增量物化视图的复杂度呈正相关:
| Data Row | Materialized View Num | Insert Time/s |
|---|---|---|
| 1000 | 0 | 5.46 |
| 1000 | 1 | 15.97 |
| 1000 | 2 | 25.86 |
| 1000 | 3 | 35.47 |