Skip to main content

Metrics Monitoring

By default, the system collects metric data every minute, including storage, query, job, metadata, and cleanup mechanism. The monitoring data is stored in the specified InfluxDB and displayed through Grafana. It can help administrators to understand the health of the system in order to take necessary actions.

Note: Since Grafana depends on InfluxDB, please make sure that InfluxDB is correctly configured and started according to Use InfluxDB as Time-Series Database and download Grafana before you use Grafana.

# Download Grafana
$KYLIN_HOME/sbin/download-grafana.sh

Note: The Grafana installation path, will be under the grafana directory in the installation directory of Kylin.

Grafana

  1. Working Directory: $KYLIN_HOME/grafana
  2. Configuration Directory: $KYLIN_HOME/grafana/conf
  3. Start Grafana Command: $ KYLIN_HOME/bin/grafana.sh start
  4. Stop Grafana Command: $ KYLIN_HOME/bin/grafana.sh stop

Changing grafana configuration please refer to Configuration.

After the startup is successful, you may access Grafana through web browser with default port: 3000, username: admin, password: admin

metrics_dashboard

Dashboard

Default Dashboard: Kylin 5.0

The dashboard consists of 10 modules: Cluster, Summaries, Models, Queries, Favorites, Jobs, Cleanings, Metadata Operations, Transactions, among which Summaries module is automatically displayed in detail. Read more details about the modules, please refer to Metrics Explanation. If you want to make some changes for the dashboard, please refer to Grafana official website manual Provisioning Grafana.

Panel

Each indicator monitor corresponds to a specific panel.

Time Range

In the upper right corner of the dashboard, choose the time range. Time range: the time interval in which the indicator was observed. metrics_interval

Data Granularity

Located in the upper left corner of the dashboard, the data granularity: auto, 1m, 5m, 10m, 30m, 1h, 6h, 12h, 1d, 7d, 14d, 30d ('auto' is automatically adjusted according to the time range, such as the time range '30min' corresponding granularity 5min, and the granularity corresponding to the time range of 24h is 4h).

Metrics Explanation

Tip : “Project related” in the following table indicates whether the metric is related to the project, “Y” indicates that the metric is related to the project, and “N” indicates that the metric is not related to the project. "Host related" in the following table indicates whether the metric is related to Kylin nodes, "Y" indicates that the metric is related to the Kylin nodes, "N" indicates that the metric is not related to the host. "all", "job", "query" is Kylin nodes' server mode.

Cluster: Cluster overview

NameMeaningProject related
build_unavailable_durationthe unavailable time of buildingN
query_unavailable_durationthe unavailable time of queryN
**Summaries**: Global overview
NameMeaningProject relatedHost relatedRemark
summary_exec_total_timesTimes of all indicators collectedNY(all, job, query)The cost of collecting indicators
summary_exec_total_durationDuration of all indicators collectedNY(all, job, query)The cost of collecting indicators
num_of_projectsTotal project numberNN-
storage_size_gaugeStorage used of the systemYN-
num_of_usersTotal user numberNN-
num_of_hive_tablesTotal data table numberYN-
num_of_hive_databasesTotal database numberYN-
summary_of_heapThe heap size of KylinNY(all, job, query)-
usage_of_heapThe ratio of heap of KylinNY(all, job, query)-
count_of_garbage_collectionThe count of garbage collectionNY(all, job, query)-
time_of_garbage_collectionThe total time of garbage collectionNY(all, job, query)-
garbage_size_gaugeStorage used of garbageYNRefer to the definition of "Garbage"
sparder_restart_total_times"Sparder" restart timesNY(all, job, query)"Sparder" is the internal query engine
query_loadspark sql loadNY(all, query)-
cpu_coresThe number of cup cores for query configured in kylin.propertiesNY(all, query)Refer "Spark-related Configuration"
NameMeaningProject relatedHost related
model_num_gauge"Model number: curve with timeYN
non_broken_model_num_gauge"Healthy model number" curve with timeYN
last_query_time_of_modelsThe last query time of modelsYN
hit_count_of_modelsThe query hit count of modelsYN
storage_of_modelsThe storage of modelsYN
segments_num_of_modelsThe num of segments of modelsYN
model_build_durationTotal build time of modelsYN
model_wait_durationTotal wait time of modelsYN
number_of_indexesindexes number of modelsYN
expansion_rate_of_modelsExpansion rate of modelsYN
model_build_duration (avg)Avg build time of modelsYN
NameMeaningProject relatedHost relatedRemark
count_of_queriesTotal count of queriesYY(all, query)-
num_of_query_per_hostThe num of query per hostNY(all, query)-
count_of_queries_hitting_agg_indexThe count of queries hitting agg indexYY(all, query)-
ratio_of_queries_hitting_agg_indexThe ratio of queries hitting agg indexYY(all, query)-
count_of_queries_hitting_table_indexThe count of queries hitting table indexYY(all, query)-
ratio_of_queries_hitting_table_indexThe ratio of queries hitting table indexYY(all, query)-
count_of_pushdown_queriesThe count of pushdown queriesYY(all, query)-
ratio_of_pushdown_queriesThe ratio of pushdown queriesYY(all, query)-
count_of_queries_hitting_cacheThe count of queries hitting cacheYY(all, query)-
ratio_of_queries_hitting_cacheThe ratio of queries hitting cacheYY(all, query)-
count_of_queries_less_than_1sTotal count of queries when duration is less than 1 secondYY(all, query)-
ratio_of_queries_less_than_1sThe ratio of queries when duration is less than 1 secondYY(all, query)-
count_of_queries_between_1s_and_3sTotal count of queries when duration is between 1 second and 3 secondsYY(all, query)-
ration_of_queries_between_1s_and_3sThe ratio of queries when duration is between 1 second and 3 secondsYY(all, query)-
count_of_queries_between_3s_and_5sTotal count of queries when duration is between 3 seconds and 5 secondsYY(all, query)-
ratio_of_queries_between_3s_and_5sThe ratio of queries when duration is between 3 seconds and 5 secondsYY(all, query)-
count_of_queries_between_5s_and_10sTotal count of queries when duration is between 5 seconds and 10 secondsYY(all, query)-
ratio_of_queries_between_5s_and_10sThe ratio of queries when duration is between 5 seconds and 10 secondsYY(all, query)-
count_of_queries_greater_than_10sTotal count of queries when duration exceeding 10 secondsYY(all, query)-
ratio_of_queries_greater_than_10sThe ratio of queries when duration exceeding 10 secondsYY(all, query)-
count_of_timeout_queriesThe count of timeout queriesYY(all, query)-
count_of_failed_queriesThe count of failed queriesYY(all, query)-
mean_time_of_query_per_hostThe mean time of queries per hostNY(all, query)-
99%_of_query_latencyQuery duration 99-percentileYY(all, query)-
gt10s_query_rate_5-minuteQuery duration exceeding 10s per second over 5 minutesYY(all, query)-
failed_query_rate_5-minuteFailed queries per second over 5 minutesYY(all, query)-
pushdown_query_rate_5-minutePushdown queries per second over 5 minutesYY(all, query)-
scan_bytes_of_99%_queriesQuery scan bytes 99-percentileYY(all, query)-
query_scan_bytes_of_hostQuery scan bytes per hostNY(all, query)-
mean_scan_bytes_of_queriesThe mean scan bytes of queriesYY(all, query)-
NameMeaningProject relatedHost relatedRemark
fq_accepted_total_timesFavorite Query user submitted total timesYY(all, job, query)-
fq_proposed_total_timesFavorite Query system triggered total timesYN-
fq_proposed_total_durationFavorite Query system triggered total durationYN-
failed_fq_proposed_total_timesFavorite Query system triggered failed total timesYNRefer to the definition of "pushdown"
fq_adjusted_total_timesFavorite Query system adjusted total timesYY(all, job, query)-
fq_adjusted_total_durationFavorite Query system adjusted total durationYY(all, job, query)-
fq_update_usage_total_timesFavorite Query usage updated total timesYN-
fq_update_usage_total_durationFavorite Query usage updated total durationYN-
failed_fq_update_usage_total_timesFavorite Query usage updated failed total timesYN-
fq_tobeaccelerated_num_gaugeFavorite Query to be acceleratedYN-
fq_accelerated_num_gaugeFavorite Query acceleratedYN-
fq_failed_num_gaugeFavorite Query accelerated failed timesYN-
fq_accelerating_num_gaugeFavorite Query acceleratingYN-
fq_pending_num_gaugeFavorite Query pendingYNFavorite Query lacks of necessary conditions, such as missing column names, requiring user intervention
fq_blacklist_num_gaugeFavorite Query in blacklistYNRefer to the definition of "Blacklist"
NameMeaningProject relatedHost related
num_of_jobs_createdJobs created total numberYY(all, job)
num_of_jobs_finishedJobs finished total numberYY(all, job)
num_of_running_jobsThe num of running jobs currentlyYN
num_of_pending_jobsThe num of pending jobs currentlyYN
num_of_error_jobsThe num of error jobs currentlyYN
count_of_error_jobsThe total count of errorYY(all, job)
finished_jobs_total_durationJobs finished total durationYY(all, job)
job_duration_99pJobs duration 99-percentileYY(all, job)
job_step_attempted_total_timesJobs step attempted total timesYY(all, job)
failed_job_step_attempted_total_timesJobs step attempted failed total timesYY(all, job)
job_resumed_total_timesJobs resumed total timesYY(all, job)
job_discarded_total_timesJobs discarded total timesYY(all, job)
job_durationThe build duration of jobYY(all, job)
job_wait_durationThe wait duration of jobYY(all, job)
NameMeaningProject relatedHost related
storage_clean_total_timesStorage cleanup total timesNY(all, job, query)
storage_clean_total_durationStorage cleanup total durationNY(all, job, query)
failed_storage_clean_total_timesStorage cleanup failed total timesNY(all, job, query)
NameMeaningProject relatedHost relatedRemark
metadata_clean_total_timesMetadata cleanup total timesYY(all, job, query)-
metadata_backup_total_timesMetadata backup total timesYY(all, job, query)Differentiate projects and global
metadata_backup_total_durationMetadata backup total durationYY(all, job, query)Differentiate projects and global
failed_metadata_backup_total_timesMetadata backup failed total timesYY(all, job, query)Differentiate projects and global
metadata_ops_total_timesMetadata daily operations total timesNY(all, job, query)Fixed time per day (configurable): automatically backup metadata; rotate audit_log; cleanup metadata and storage space; adjust FQ; cleanup query histories.
metadata_success_ops_total_timesMetadata daily operations failed total timesNY(all, job, query)-
NameMeaningProject relatedHost relatedRemark
transaction_retry_total_timesTransactions retried total timesYY(all, job, query)Differentiate projects, and, global
transaction_latency_99pTransactions duration 99-percentileYY(all, job, query)Differentiate projects, and, global