0%

使用Clickhouse的materialized view实现By Time Alignment的预聚合

常规的视图(View)可以看作一个预先保存好的查询,而Clickhouse的物化视图(materialized view,MV)功能类似于数据的插入触发器,会在来源数据插入的时候按照定义更新MV对应的数据

场景:统计平台不同时间段(分钟、小时、6小时、天、周、月、年)的下载量。原始数据:(when,platform,count)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#建表
CREATE TABLE download
(
when DateTime,
platform UInt32,
cnt UInt32
) ENGINE = MergeTree
PARTITION BY toYYYYMM(when)
ORDER BY (platform, when);

#生成数据
INSERT INTO download
SELECT
now() + number * 60 as when,
rand() % 10,
rand() % 100000000
FROM system.numbers
LIMIT 50000

查看数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#按天
SELECT
toStartOfDay(when) AS day,
platform,
sum(cnt) AS cnt
FROM download
GROUP BY platform, day
order by day asc,platform
#按周
SELECT
toStartOfWeek(when) AS week,
platform,
sum(cnt) AS cnt
FROM download
GROUP BY platform, week
order by week asc,platform

按天的查询可以建立为MV,其中的POPULATE表示从已有的数据中生成,MV创建完成后原始数据表中的新增数据将自动按照GROUP BY条件更新到MV中。
需要注意MV在创建到数据导入完成会存在一部分时间,实际使用中一般不指定POPULATE,而在MV创建完成后再分段的注入数据,保证数据一致性。

1
2
3
4
5
6
7
8
9
10
11
12
CREATE MATERIALIZED VIEW downloads_by_day
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY (platform, day) POPULATE AS
SELECT
toStartOfDay(when) AS day,
platform,
sum(cnt) AS cnt
FROM download
GROUP BY
platform,
day;

同样的,我们可以按照分析需求建立更多层级的MV,也可以在MV中选择不同的聚合函数。结合输入端给到的时间范围以及展示上最多需要的点数,调整需要查询MV。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public enum TimeAlign
{
By1Min = 1,
By5Min = 2,
By10Min = 3,
By15Min = 4,
By30Min = 5,
By1Hour = 6,
By6Hour = 7,
By12Hour = 8,
By1Day = 9,
By1Week = 10,
By1Month = 11
}