Apache Kylin™ Technical Blog

RAW measure in Apache Kylin

Introduction

RAW measure function is use to query the detail data on the measure column in Kylin.

Example data:

DT SITE_ID SELLER_ID ITEM_COUNT
2016-05-01 0 SELLER-001 100
2016-05-01 0 SELLER-002 200
2016-05-02 1 SELLER-003 300
2016-05-02 1 SELLER-004 400
2016-05-03 2 SELLER-005 500

We design the cube desc is the DT,SITE_ID columns as dimensions, and SUM(ITEM_COUNT) as measure. So, the base cuboid data will like this:

Rowkey of base cuboid SUM(ITEM_COUNT)
2016-05-01_0 300
2016-05-02_1 700
2016-05-03_2 500

For the first row in the base cuboid data, Kylin can extract the dimension column values 2016-05-01,0 from the HBase Rowkey, and in the measure cell will store the measure function’s aggregated results 300, we can’t get the raw value 100 and 200 which before the aggregation on the ITEM_COUNT column.

The RAW function is use to make the SQL:

SELECT DT,SITE_ID,ITEM_COUNT FROM FACT_TABLE

to return the correct result:

DT SITE_ID ITEM_COUNT
2016-05-01 0 100
2016-05-01 0 200
2016-05-02 1 300
2016-05-02 1 400
2016-05-03 2 500

How to use

  • Choose the Kylin version 1.5.1+.
  • Like the above case, we can make the DT,SITE_ID as dimensions, and RAW(ITEM_COUNT)as measure.
  • After the cube build, you can use the SQL to query the raw data:
SELECT DT,SITE_ID,ITEM_COUNT FROM FACT_TABLE WHERE SITE_ID = 0

Optimize

The column which define RAW measure will be encoded with dictionary by default. So, you must know you data’s cardinality and distribution characteristics.

  • As far as possible to define the value uniform distribution column to dimensions, this will make the measure cell value size more uniform and avoid data skew.
  • If choose the ultra high cardinality column to define RAW measure, you can try the following to avoid the dictionary build error:
    1. Cut a big segment into several segments, if you were trying to build a large data set at once;
    2. Set kylin.dictionary.max.cardinality in conf/kylin.properties to a bigger value (default is 5000000).

To be improved

  • Now, the maximum storage 1M values of RAW measure in one cuboid. If exceed 1M values, it will throw BufferOverflowException in the cube build. This will be optimized in the later release.
  • Only dimension column can use in WHERE condition, RAW measure column is not support.

Implement

  • Custom one aggregation function RAW implement, the function’s return type depends on the column type.
  • Make the RAW aggregation function to save the column raw data in the base cuboid data.
  • The HBase value cell will store the dictionary id of the raw data to save space.
  • The SQL which contains the RAW measure column will be routed to the base cuboid query.
  • Extract the raw data from base cuboid data with dimension values to assemble into a complete row when query.