FAQ
Here are some tips for you when encountering problems with Kylin: 1. Use search engines (Google / Baidu), Kylin's Mailing List Archives, the Kylin Project on the Apache JIRA to seek a solution. 2. Browse Kylin's official website, especially the Docs page and the FAQ page. 3. Send an email to Apache Kylin dev or user mailing list: dev@kylin.apache.org, user@kylin.apache.org; before sending, please make sure you have subscribed the mailing list by dropping an email to dev-subscribe@kylin.apache.org or user-subscribe@kylin.apache.org. Your email is supposed to include: the version numbers of Kylin and other components you are using in your env, the log of the error message, the SQL (if you got the query error). There is an article about how to ask a question in a smart way.
Is Kylin a generic SQL engine for big data?
- No, Kylin is an OLAP engine with SQL interface. The SQL queries should be matched with the pre-defined OLAP model.
What's a typical scenario to use Apache Kylin?
- Kylin can be the best option if you have a huge table (e.g., >100 million rows), join with lookup tables, while queries need to be finished in the second level (dashboards, interactive reports, business intelligence, etc), and the concurrent users can be dozens or hundreds.
How large a data scale can Kylin support? How about the performance?
- Kylin can support second level query performance at TB to PB level dataset. This has been verified by users like eBay, Meituan, Toutiao. Take Meituan's case as an example (till 2018-08), 973 cubes, 3.8 million queries per day, raw data 8.9 trillion, total cube size 971 TB (original data is bigger), 50% of the queries finished in < 0.5 seconds, 90% queries < 1.2 seconds.
Who are using Apache Kylin?
- You can find a list in Kylin's powered by page. If you want to be added, please email to dev@kylin.apache.org with your use case.
What's the expansion rate of Cube (compared with raw data)?
- It depends on a couple of factors, for example, dimension/measure number, dimension cardinality, cuboid number, compression algorithm, etc. You can optimize the cube expansion in many ways to control the size.
How to compare Kylin with other SQL engines like Hive, Presto, Spark SQL, Impala?
- They answer a query in different ways. Kylin is not a replacement for them, but a supplement (query accelerator). Many users run Kylin together with other SQL engines. For the high frequent query patterns, building Cubes can greatly improve the performance and also offload cluster workloads. For less queried patterns or ad-hoc queries, other MPP engines are more flexible.
How to compare Kylin with Druid?
-
Druid is more suitable for real-time analysis. Kylin is more focus on OLAP case. Druid has a good integration with Kafka as real-time streaming; Kylin fetches data from Hive or Kafka in batches. The real-time capability of Kylin is still under development.
-
Many internet service providers host both Druid and Kylin, serving different purposes (real-time and historical).
-
Some other Kylin's highlights: supports star & snowflake schema; ANSI-SQL support, JDBC/ODBC for BI integrations. Kylin also has a Web GUI with LDAP/SSO user authentication.
-
For more information, please do a search or check this mail thread.
How to quick start with Kylin?
- To get a quick start, you can run Kylin in a Hadoop sandbox VM or in the cloud, for example, start a small AWS EMR or Azure HDInsight cluster and then install Kylin in one of the node.
How many nodes of the Hadoop are needed to run Kylin?
-
Kylin can run on a Hadoop cluster from only a couple nodes to thousands of nodes, depends on how much data you have. The architecture is horizontally scalable.
-
Because most of the computation is happening in Hadoop (MapReduce/Spark/HBase), usually you just need to install Kylin in a couple of nodes.
How many dimensions can be in a cube?
-
The max physical dimension number (exclude derived columns in lookup tables) in a cube is 63; If you can normalize some dimensions to lookup tables, with derived dimensions, you can create a cube with more than 100 dimensions.
-
But a cube with > 30 physical dimensions is not recommended; You even couldn't save that in Kylin if you don't optimize the aggregation groups. Please search "curse of dimensionality".
Why do I got an error when running a "select * " query?
-
The cube has only the aggregated data, so all your queries should be aggregated queries ("GROUP BY"). You can use a SQL with all dimensions be grouped to get them as close as the detailed result, but that is not the raw data.
-
In order to be connected from some BI tools, Kylin tries to answer "select *" query but please aware the result might not be expected. Please make sure each query to Kylin is aggregated.