This is an Aviate-only feature. |
Introduction
Aviate Health is a feature provided by the Aviate plugin. It is designed to give users valuable insights into the overall health and performance of a Kill Bill installation. As part of this feature, the aviate plugin exposes dedicated health-related endpoints that allow you to monitor and manage various aspects of your Kill Bill setup. Some of the capabilities offered by this feature include viewing detailed health metrics for all nodes within a Kill Bill installation, identifying and resolving issues such as stuck bus or notification entries, and generating comprehensive diagnostic reports to aid in troubleshooting and system optimization.
Getting Started with Aviate Health
This section provides a step-by-step approach to start using Aviate Health.
Installing the Plugin
The Aviate plugin can be installed as documented in the How to Install the Aviate Plugin doc.
Enabling Aviate Health
When the Aviate plugin is installed, Aviate Heath is enabled by default.
The following configuration property controls this feature:
com.killbill.billing.plugin.aviate.enableHealthReporter=true
Refer to the Kill Bill Configuration Guide to know more about setting configuration properties.
Using Health APIs
As mentioned earlier, Aviate Health exposes endpoints that allow monitoring the health of a KB installation and fixing problems if any. Once the aviate plugin is installed, you can start using the Aviate Health APIs. These are documented in our api docs.
Aviate Metrics
The Aviate plugin provides a Retrieve Metrics endpoint that can be used to assess the health of the system.
This section provides some insights into the metrics that can be retrieved via this endpoint.
Metrics Overview
The metrics exposed by the Aviate plugin can mainly be categorized in the following event groups (types):
-
Gauge - A gauge return a single numerical value per minute/hour/day (based on the value of the
granularity
parameter) -
Meter - A meter provide the rate over time. They provide different data points for the following sample kinds:
Sample Kind Description count
Monotonic increasing value since last reboot.
{one_minute/five_minute/fifteen_minute}_rate
Rate through a window of time.
mean_rate
Mean rate since last reboot.
-
Timer - A timer measures the rate of events and the duration of those events. They provide different data points for the following sample kinds:
Sample Kind Description mean_rate
Mean rate since last reboot.
{one_minute/five_minute/fifteen_minute}_rate
Rate through a window of time.
tp99, tp999, tp75, tp98, tp95
Percentile for the metrics since last reboot.
min
Min value since last reboot.
max
Max value since last reboot.
count
Monotonic increasing value since last reboot.
median
Median value since last reboot.
std_dev
Standard deviation since last reboot.
Queue Metrics
Queue metrics can be used to assess the health of the Kill Bill internal queues.
Kill Bill has its own internal queues used to dispatch events. Events that are dispatched right away as a result of some internal state being created or updated are called bus events - e.g. a subscription_creation event is generated as a result of creating a new subscription. Events that are scheduled to be dispatched in the future are called notifications - e.g. invoice scheduled on a periodic basis matching account settings and plan billing periods. The health of these internal queues is critical to maintaining correct functioning of the system.
Note that the metrics associated with the queues are global (as opposed to computed per node) so the nodeName
query parameter will be ignored. Additionally, all the queue metrics are of Gauge
and therefore return a single value.
The following table lists these metrics:
Metric Name | Description |
---|---|
queue.bus.late |
This is a counter that shows how many unprocessed bus events we have at time 't'. This number should be close to 0. |
queue.bus.incoming |
The is rate of incoming bus events at time t. The default granularity is |
queue.bus.processing |
This is an estimation of the time in mSec that was used to process the bus event. These values may become incorrect once we have late bus entries. |
queue.notifications.late |
This is a counter that shows how many unprocessed notifications we have at time 't'. This number should be close to 0. |
queue.notifications.incoming |
The is rate of incoming notifications at time t. The default granularity is |
queue.notifications.processing |
This is an estimation of the time in mSec that was used to process the notification. These values may become incorrect once we have late notifications. |
Logs
Kill Bill is configured to output its internal logs as specified by the logback.xml
configuration (See docs). The aviate plugin running on each node extracts important information from the logs and computes some metrics to highlight potential issues with warn
and error
logs that have happened through time.
Those metrics are computed per node. Additionally, the log metrics are all Meter
metrics, and so they each provide different time series as specified by the Sample Kind
listed above.
The following table lists these metrics:
Metric Name | Description |
---|---|
logs.rates.warn |
Represents warnings in logs |
logs.rates.error |
Represents errors in logs |
Servlet Responses
Servlet metrics provide visibility into any of the endpoints exposed by the system, either from Kill Bill core (/1.0/kb
) or any plugins exposing endpoints.
These metrics are computed per node. The servlet metrics are Meter
metrics, so they each provide different time series as specified by the Sample Kind
listed above.
The following table lists these metrics:
Metric Name | Description |
---|---|
servlets.responses.ok |
Represents successful responses. |
servlets.responses.created |
Represents created responses. |
servlets.responses.badRequest |
Represents bad request responses. |
servlets.responses.noContent |
Represents no content responses. |
servlets.responses.notFound |
Represents not found responses. |
servlets.responses.serverError |
Represents server error responses. |
servlets.responses.other |
Represents other responses. |
Database Connection Pools
Kill Bill uses 3 different database connection pools: main
, shiro
, and osgi
. main
and shiro
are internal connection pools within Kill Bill core. The osgi
connection pool is used by the plugins running on top of the Kill Bill platform for any database calls.
The connection pool metrics are computed per node. The following metrics are Gauge
metrics and so they return a single value:
Metric Name | Description |
---|---|
main.pool.TotalConnections |
Total (created) connections at time 't' for |
main.pool.ActiveConnections |
Active (in use) connections at time 't' for |
main.pool.IdleConnections |
Idle connections at time 't' for |
osgi.pool.TotalConnections |
Total (created) connections at time 't' for |
osgi.pool.ActiveConnections |
Active (in use) connections at time 't' for |
osgi.pool.IdleConnections |
Idle connections at time 't' for |
shiro.pool.TotalConnections |
Total (created) connections at time 't' for |
shiro.pool.ActiveConnections |
Active (in use) connections at time 't' for |
shiro.pool.IdleConnections |
Idle connections at time 't' for |
The following metrics are timer metrics and provide different data points for the sample kinds listed above:
Metric Name | Description |
---|---|
main.pool.Wait |
Wait time to get a connection from the pool. |
osgi.pool.Wait |
Wait time to get a connection from the pool. |
shiro.pool.Wait |
Wait time to get a connection from the pool. |