Metrics Monitoring

By default, the system collects metric data every minute, including storage, query, job, metadata, and cleanup mechanism. The monitoring data is stored in the specified InfluxDB and displayed through Grafana. It can help administrators to understand the health of the system in order to take necessary actions.

Note: Since Grafana depends on InfluxDB, please make sure that InfluxDB is correctly configured and started according to Use InfluxDB as Time-Series Database and download Grafana before you use Grafana.

# Download Grafana
$KYLIN_HOME/sbin/download-grafana.sh

Note: The Grafana installation path, will be under the grafana directory in the installation directory of Kylin.

Grafana

Working Directory: $KYLIN_HOME/grafana
Configuration Directory: $KYLIN_HOME/grafana/conf
Start Grafana Command: $ KYLIN_HOME/bin/grafana.sh start
Stop Grafana Command: $ KYLIN_HOME/bin/grafana.sh stop

Changing grafana configuration please refer to Configuration.

After the startup is successful, you may access Grafana through web browser with default port: 3000, username: admin, password: admin

metrics_dashboard

Dashboard

Default Dashboard: Kylin 5.0

The dashboard consists of 10 modules: Cluster, Summaries, Models, Queries, Favorites, Jobs, Cleanings, Metadata Operations, Transactions, among which Summaries module is automatically displayed in detail. Read more details about the modules, please refer to Metrics Explanation. If you want to make some changes for the dashboard, please refer to Grafana official website manual Provisioning Grafana.

Panel

Each indicator monitor corresponds to a specific panel.

Time Range

In the upper right corner of the dashboard, choose the time range. Time range: the time interval in which the indicator was observed. metrics_interval

Data Granularity

Located in the upper left corner of the dashboard, the data granularity: auto, 1m, 5m, 10m, 30m, 1h, 6h, 12h, 1d, 7d, 14d, 30d ('auto' is automatically adjusted according to the time range, such as the time range '30min' corresponding granularity 5min, and the granularity corresponding to the time range of 24h is 4h).

Metrics Explanation

Cluster: Cluster overiew
Summaries: Global overview
Models: Model related metrics
Queries: Query related metrics
Favorites: Favorite Query related metrics
Jobs: Job related metrics
Cleanings: Cleanup mechanisms related metrics
Metadata Operations: Metadata operations related metrics
Transactions: Transaction mechanisms related metrics

Tip : “Project related” in the following table indicates whether the metric is related to the project, “Y” indicates that the metric is related to the project, and “N” indicates that the metric is not related to the project. "Host related" in the following table indicates whether the metric is related to Kylin nodes, "Y" indicates that the metric is related to the Kylin nodes, "N" indicates that the metric is not related to the host. "all", "job", "query" is Kylin nodes' server mode.

Cluster: Cluster overview

Name	Meaning	Project related
build_unavailable_duration	the unavailable time of building	N
query_unavailable_duration	the unavailable time of query	N

**Summaries**: Global overview

Name	Meaning	Project related	Host related	Remark
summary_exec_total_times	Times of all indicators collected	N	Y(all, job, query)	The cost of collecting indicators
summary_exec_total_duration	Duration of all indicators collected	N	Y(all, job, query)	The cost of collecting indicators
num_of_projects	Total project number	N	N	-
storage_size_gauge	Storage used of the system	Y	N	-
num_of_users	Total user number	N	N	-
num_of_hive_tables	Total data table number	Y	N	-
num_of_hive_databases	Total database number	Y	N	-
summary_of_heap	The heap size of Kylin	N	Y(all, job, query)	-
usage_of_heap	The ratio of heap of Kylin	N	Y(all, job, query)	-
count_of_garbage_collection	The count of garbage collection	N	Y(all, job, query)	-
time_of_garbage_collection	The total time of garbage collection	N	Y(all, job, query)	-
garbage_size_gauge	Storage used of garbage	Y	N	Refer to the definition of "Garbage"
sparder_restart_total_times	"Sparder" restart times	N	Y(all, job, query)	"Sparder" is the internal query engine
query_load	spark sql load	N	Y(all, query)	-
cpu_cores	The number of cup cores for query configured in kylin.properties	N	Y(all, query)	Refer "Spark-related Configuration"

Name	Meaning	Project related	Host related
model_num_gauge	"Model number: curve with time	Y	N
non_broken_model_num_gauge	"Healthy model number" curve with time	Y	N
last_query_time_of_models	The last query time of models	Y	N
hit_count_of_models	The query hit count of models	Y	N
storage_of_models	The storage of models	Y	N
segments_num_of_models	The num of segments of models	Y	N
model_build_duration	Total build time of models	Y	N
model_wait_duration	Total wait time of models	Y	N
number_of_indexes	indexes number of models	Y	N
expansion_rate_of_models	Expansion rate of models	Y	N
model_build_duration (avg)	Avg build time of models	Y	N

Name	Meaning	Project related	Host related	Remark
count_of_queries	Total count of queries	Y	Y(all, query)	-
num_of_query_per_host	The num of query per host	N	Y(all, query)	-
count_of_queries_hitting_agg_index	The count of queries hitting agg index	Y	Y(all, query)	-
ratio_of_queries_hitting_agg_index	The ratio of queries hitting agg index	Y	Y(all, query)	-
count_of_queries_hitting_table_index	The count of queries hitting table index	Y	Y(all, query)	-
ratio_of_queries_hitting_table_index	The ratio of queries hitting table index	Y	Y(all, query)	-
count_of_pushdown_queries	The count of pushdown queries	Y	Y(all, query)	-
ratio_of_pushdown_queries	The ratio of pushdown queries	Y	Y(all, query)	-
count_of_queries_hitting_cache	The count of queries hitting cache	Y	Y(all, query)	-
ratio_of_queries_hitting_cache	The ratio of queries hitting cache	Y	Y(all, query)	-
count_of_queries_less_than_1s	Total count of queries when duration is less than 1 second	Y	Y(all, query)	-
ratio_of_queries_less_than_1s	The ratio of queries when duration is less than 1 second	Y	Y(all, query)	-
count_of_queries_between_1s_and_3s	Total count of queries when duration is between 1 second and 3 seconds	Y	Y(all, query)	-
ration_of_queries_between_1s_and_3s	The ratio of queries when duration is between 1 second and 3 seconds	Y	Y(all, query)	-
count_of_queries_between_3s_and_5s	Total count of queries when duration is between 3 seconds and 5 seconds	Y	Y(all, query)	-
ratio_of_queries_between_3s_and_5s	The ratio of queries when duration is between 3 seconds and 5 seconds	Y	Y(all, query)	-
count_of_queries_between_5s_and_10s	Total count of queries when duration is between 5 seconds and 10 seconds	Y	Y(all, query)	-
ratio_of_queries_between_5s_and_10s	The ratio of queries when duration is between 5 seconds and 10 seconds	Y	Y(all, query)	-
count_of_queries_greater_than_10s	Total count of queries when duration exceeding 10 seconds	Y	Y(all, query)	-
ratio_of_queries_greater_than_10s	The ratio of queries when duration exceeding 10 seconds	Y	Y(all, query)	-
count_of_timeout_queries	The count of timeout queries	Y	Y(all, query)	-
count_of_failed_queries	The count of failed queries	Y	Y(all, query)	-
mean_time_of_query_per_host	The mean time of queries per host	N	Y(all, query)	-
99%_of_query_latency	Query duration 99-percentile	Y	Y(all, query)	-
gt10s_query_rate_5-minute	Query duration exceeding 10s per second over 5 minutes	Y	Y(all, query)	-
failed_query_rate_5-minute	Failed queries per second over 5 minutes	Y	Y(all, query)	-
pushdown_query_rate_5-minute	Pushdown queries per second over 5 minutes	Y	Y(all, query)	-
scan_bytes_of_99%_queries	Query scan bytes 99-percentile	Y	Y(all, query)	-
query_scan_bytes_of_host	Query scan bytes per host	N	Y(all, query)	-
mean_scan_bytes_of_queries	The mean scan bytes of queries	Y	Y(all, query)	-

Name	Meaning	Project related	Host related	Remark
fq_accepted_total_times	Favorite Query user submitted total times	Y	Y(all, job, query)	-
fq_proposed_total_times	Favorite Query system triggered total times	Y	N	-
fq_proposed_total_duration	Favorite Query system triggered total duration	Y	N	-
failed_fq_proposed_total_times	Favorite Query system triggered failed total times	Y	N	Refer to the definition of "pushdown"
fq_adjusted_total_times	Favorite Query system adjusted total times	Y	Y(all, job, query)	-
fq_adjusted_total_duration	Favorite Query system adjusted total duration	Y	Y(all, job, query)	-
fq_update_usage_total_times	Favorite Query usage updated total times	Y	N	-
fq_update_usage_total_duration	Favorite Query usage updated total duration	Y	N	-
failed_fq_update_usage_total_times	Favorite Query usage updated failed total times	Y	N	-
fq_tobeaccelerated_num_gauge	Favorite Query to be accelerated	Y	N	-
fq_accelerated_num_gauge	Favorite Query accelerated	Y	N	-
fq_failed_num_gauge	Favorite Query accelerated failed times	Y	N	-
fq_accelerating_num_gauge	Favorite Query accelerating	Y	N	-
fq_pending_num_gauge	Favorite Query pending	Y	N	Favorite Query lacks of necessary conditions, such as missing column names, requiring user intervention
fq_blacklist_num_gauge	Favorite Query in blacklist	Y	N	Refer to the definition of "Blacklist"

Name	Meaning	Project related	Host related
num_of_jobs_created	Jobs created total number	Y	Y(all, job)
num_of_jobs_finished	Jobs finished total number	Y	Y(all, job)
num_of_running_jobs	The num of running jobs currently	Y	N
num_of_pending_jobs	The num of pending jobs currently	Y	N
num_of_error_jobs	The num of error jobs currently	Y	N
count_of_error_jobs	The total count of error	Y	Y(all, job)
finished_jobs_total_duration	Jobs finished total duration	Y	Y(all, job)
job_duration_99p	Jobs duration 99-percentile	Y	Y(all, job)
job_step_attempted_total_times	Jobs step attempted total times	Y	Y(all, job)
failed_job_step_attempted_total_times	Jobs step attempted failed total times	Y	Y(all, job)
job_resumed_total_times	Jobs resumed total times	Y	Y(all, job)
job_discarded_total_times	Jobs discarded total times	Y	Y(all, job)
job_duration	The build duration of job	Y	Y(all, job)
job_wait_duration	The wait duration of job	Y	Y(all, job)

Name	Meaning	Project related	Host related
storage_clean_total_times	Storage cleanup total times	N	Y(all, job, query)
storage_clean_total_duration	Storage cleanup total duration	N	Y(all, job, query)
failed_storage_clean_total_times	Storage cleanup failed total times	N	Y(all, job, query)

Name	Meaning	Project related	Host related	Remark
metadata_clean_total_times	Metadata cleanup total times	Y	Y(all, job, query)	-
metadata_backup_total_times	Metadata backup total times	Y	Y(all, job, query)	Differentiate projects and global
metadata_backup_total_duration	Metadata backup total duration	Y	Y(all, job, query)	Differentiate projects and global
failed_metadata_backup_total_times	Metadata backup failed total times	Y	Y(all, job, query)	Differentiate projects and global
metadata_ops_total_times	Metadata daily operations total times	N	Y(all, job, query)	Fixed time per day (configurable): automatically backup metadata; rotate audit_log; cleanup metadata and storage space; adjust FQ; cleanup query histories.
metadata_success_ops_total_times	Metadata daily operations failed total times	N	Y(all, job, query)	-

Name	Meaning	Project related	Host related	Remark
transaction_retry_total_times	Transactions retried total times	Y	Y(all, job, query)	Differentiate projects, and, global
transaction_latency_99p	Transactions duration 99-percentile	Y	Y(all, job, query)	Differentiate projects, and, global

Grafana​

Dashboard​

Panel​

Time Range​

Data Granularity​

Metrics Explanation​

Cluster: Cluster overview​

Models：Model related metrics​

Queries：Query related metrics​

Favorites：Favorite Query related metrics​

Jobs：Job related metrics​

Cleanings：Cleanup mechanisms related metrics​

Metadata Operations：Metadata operations related metrics​

Transactions：Transaction mechanisms related metrics​

Grafana

Dashboard

Panel

Time Range

Data Granularity

Metrics Explanation

Cluster: Cluster overview

Models：Model related metrics

Queries：Query related metrics

Favorites：Favorite Query related metrics

Jobs：Job related metrics

Cleanings：Cleanup mechanisms related metrics

Metadata Operations：Metadata operations related metrics

Transactions：Transaction mechanisms related metrics