The Apache Kylin community is pleased to announce the release of Apache Kylin v2.5.0.
Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Big Data supporting extremely large datasets.
This is a major release after 2.4.0, including many enhancements. All of the changes can be found in the release notes. Here just highlight the major ones:
The all-in-Spark cubing engine
Now Kylin’s Spark engine will run all distributed jobs in Spark, including fetch distinct dimension values, converting cuboid files to HBase HFile, merging segments, merging dictionaries, etc. The default configurations are tuned so the user can get an out-of-box experience. The overall performance with the previous version is close, but we assume Spark has more room to improve. The related tasks are KYLIN-3427, KYLIN-3441, KYLIN-3442.
There are also improvements in the job management. Now you can get the job link on the web console once Spark starts to run. If you discard the job, Kylin will kill the Spark job to release the resource in time. If Kylin is restarted, it can resume from the previous job instead of resubmitting a new job.
MySQL as Kylin metastore
In the past, HBase is the only option for Kylin metadata. In some cases, this is not applicable, for example using replicated HBase cluster for Kylin’s HA (the replicated HBase is read only). Now we introduce the MySQL metastore to fulfill such need. This function is in beta now. Check KYLIN-3488 for more.
Hybrid model web GUI
Hybrid is an advanced model for compositing multiple Cubes. It can be used for the Cube schema change issue. This function had no GUI in the past so only a small portion of Kylin users know it. Now we added the web GUI for it so everyone can try it.
Enable Cube planner by default
The Cube planner can greatly optimize the cube structure, save the computing/storage resources and improve the query performance. It was introduced in v2.3 but is disabled by default. In order to let more users seeing and trying it, we enable it by default in v2.5. The algorithm will automatically optimize the cube by your data statistics on the first build.
Advanced segment pruning
Segment (partition) pruning can efficiently reduce the disk and network I/O, so to greatly improve the query performance. In the past, Kylin only prunes segments by the partition column’s value. If the query doesn’t have the partition column as the filtering condition, the pruning won’t work, all segments will be scanned.
Now from v2.5, Kylin will record the min/max value for EVERY dimension at the segment level. Before scanning a segment, it will compare the query’s conditions with the min/max index. If not matched, the segment will be skipped. Check KYLIN-3370 for more.
Merge dictionary on YARN
When segments get merged, their dictionaries also need to be merged. In the past, the merging happens in Kylin’s JVM, which takes a lot of memory and CPU resources. In extreme case (if you have a couple of concurrent jobs) it may crash the Kylin process. Since this, some users have to allocate much more memory to Kylin job node or run multiple job nodes to balance the workload.
Now from v2.5, Kylin will submit this task to Hadoop MR or Spark, so this bottleneck can be solved. Check KYLIN-3471 for more.
Improve building performance for reading Global Dictionary
Global Dictionary is a must for bitmap count distinct. The GD can be very large if the column has a very high cardinality. In the cube building phase, Kylin need to translate the non-integer values to integers by the GD. Although the GD has been split into several slices, the values are often scrambled. Kylin needs swap in/out the slices into memory repeatedly, which causes the building slowly.
The enhancement introduces a new step to build a shrunken dictionary for each data block. Then each task only loads the shrunken dictionary, which is quite small, so there is no swap in/out any more in the cubing step. Then the performance can be 3x faster than before. Check KYLIN-3491 for more.
Improved cube size estimation for TOPN, COUNT DISTINCT
Cube size estimation is used in several steps, such as decides the MR/Spark job partition number, calculates the HBase region number etc. It will affect the build performance much. The estimation can be wild when there is COUNT DISTINCT, TOPN measures because their size is flexible. The incorrect estimation may cause too many data partitions and then too many tasks. In the past, users need to tune several parameters to make the size estimation more close to real size, that is hard to do.
Now Kylin will correct the size estimation automatically based on the collected data statistics. This can make the estimation much closer with the real size than before. Check KYLIN-3453 for more.
Hadoop 3.0/HBase 2.0 support
Hadoop 3 and HBase 2 starts to be adopted by many users. Now we provide new binary packages compiled with the new Hadoop and HBase API. We tested them on Hortonworks HDP 3.0 and Cloudera CDH 6.0.
To download Apache Kylin v2.5.0 source code or binary package, visit the download page.
Follow the upgrade guide.
If you face issue or question, please send mail to Apache Kylin dev or user mailing list: firstname.lastname@example.org , email@example.com; Before sending, please make sure you have subscribed the mailing list by dropping an email to firstname.lastname@example.org or email@example.com.
Great thanks to everyone who contributed!