Aviate Health

This is an Aviate-only feature.

Introduction

Aviate Health is a feature provided by the Aviate plugin. It is designed to give users valuable insights into the overall health and performance of a Kill Bill installation. As part of this feature, the aviate plugin exposes dedicated health-related endpoints that allow you to monitor and manage various aspects of your Kill Bill setup. Some of the capabilities offered by this feature include viewing detailed health metrics for all nodes within a Kill Bill installation, identifying and resolving issues such as stuck bus or notification entries, and generating comprehensive diagnostic reports to aid in troubleshooting and system optimization.

Getting Started with Aviate Health

This section provides a step-by-step approach to start using Aviate Health.

Installing the Plugin

The Aviate plugin can be installed as documented in the How to Install the Aviate Plugin doc.

Enabling Aviate Health

When the Aviate plugin is installed, Aviate Heath is enabled by default.

The following configuration property controls this feature:

com.killbill.billing.plugin.aviate.enableHealthReporter=true

Refer to the Kill Bill Configuration Guide to know more about setting configuration properties.

Using Health APIs

As mentioned earlier, Aviate Health exposes endpoints that allow monitoring the health of a KB installation and fixing problems if any. Once the aviate plugin is installed, you can start using the Aviate Health APIs. These are documented in our api docs.

Aviate Metrics

The Aviate plugin provides a Retrieve Metrics endpoint that can be used to assess the health of the system.

This section provides some insights into the metrics that can be retrieved via this endpoint.

Metrics Overview

The metrics exposed by the Aviate plugin can mainly be categorized in the following event groups (types):

Gauge - A gauge return a single numerical value per minute/hour/day (based on the value of the granularity parameter)

Meter - A meter provide the rate over time. They provide different data points for the following sample kinds:

Sample Kind	Description
count	Monotonic increasing value since last reboot.
{one_minute/five_minute/fifteen_minute}_rate	Rate through a window of time.
mean_rate	Mean rate since last reboot.

Sample Kind

Description

count

Monotonic increasing value since last reboot.

{one_minute/five_minute/fifteen_minute}_rate

Rate through a window of time.

mean_rate

Mean rate since last reboot.

Timer - A timer measures the rate of events and the duration of those events. They provide different data points for the following sample kinds:

Sample Kind	Description
mean_rate	Mean rate since last reboot.
{one_minute/five_minute/fifteen_minute}_rate	Rate through a window of time.
tp99, tp999, tp75, tp98, tp95	Percentile for the metrics since last reboot.
min	Min value since last reboot.
max	Max value since last reboot.
count	Monotonic increasing value since last reboot.
median	Median value since last reboot.
std_dev	Standard deviation since last reboot.

Sample Kind

Description

mean_rate

Mean rate since last reboot.

{one_minute/five_minute/fifteen_minute}_rate

Rate through a window of time.

tp99, tp999, tp75, tp98, tp95

Percentile for the metrics since last reboot.

min

Min value since last reboot.

max

Max value since last reboot.

count

Monotonic increasing value since last reboot.

median

Median value since last reboot.

std_dev

Standard deviation since last reboot.

Queue Metrics

Queue metrics can be used to assess the health of the Kill Bill internal queues.

Kill Bill has its own internal queues used to dispatch events. Events that are dispatched right away as a result of some internal state being created or updated are called bus events - e.g. a subscription_creation event is generated as a result of creating a new subscription. Events that are scheduled to be dispatched in the future are called notifications - e.g. invoice scheduled on a periodic basis matching account settings and plan billing periods. The health of these internal queues is critical to maintaining correct functioning of the system.

Note that the metrics associated with the queues are global (as opposed to computed per node) so the nodeName query parameter will be ignored. Additionally, all the queue metrics are of Gauge and therefore return a single value.

The following table lists these metrics:

Metric Name Description

Metric Name	Description
queue.bus.late	This is a counter that shows how many unprocessed bus events we have at time 't'. This number should be close to 0.
queue.bus.ext.late	This is a counter that shows how many unprocessed external bus events we have at time 't'. This number should be close to 0.
queue.bus.incoming	The is rate of incoming bus events at time t. The default granularity is `MINUTE` but results can be aggregated based on the `granularity` query parameter - e.g hourly incoming rate.
queue.bus.ext.incoming	The is rate of incoming external bus events at time t. The default granularity is `MINUTE` but results can be aggregated based on the `granularity` query parameter - e.g hourly incoming rate.
queue.bus.processing	This is an estimation of the time in mSec that was used to process the bus event. These values may become incorrect once we have late bus entries.
queue.bus.ext.processing	This is an estimation of the time in mSec that was used to process the external bus event. These values may become incorrect once we have late bus entries.
queue.notifications.late	This is a counter that shows how many unprocessed notifications we have at time 't'. This number should be close to 0.
queue.notifications.incoming	The is rate of incoming notifications at time t. The default granularity is `MINUTE` but results can be aggregated based on the `granularity` query parameter - e.g hourly incoming rate.
queue.notifications.processing	This is an estimation of the time in mSec that was used to process the notification. These values may become incorrect once we have late notifications.

queue.bus.late

This is a counter that shows how many unprocessed bus events we have at time 't'. This number should be close to 0.

queue.bus.ext.late

This is a counter that shows how many unprocessed external bus events we have at time 't'. This number should be close to 0.

queue.bus.incoming

The is rate of incoming bus events at time t. The default granularity is MINUTE but results can be aggregated based on the granularity query parameter - e.g hourly incoming rate.

queue.bus.ext.incoming

The is rate of incoming external bus events at time t. The default granularity is MINUTE but results can be aggregated based on the granularity query parameter - e.g hourly incoming rate.

queue.bus.processing

This is an estimation of the time in mSec that was used to process the bus event. These values may become incorrect once we have late bus entries.

queue.bus.ext.processing

This is an estimation of the time in mSec that was used to process the external bus event. These values may become incorrect once we have late bus entries.

queue.notifications.late

This is a counter that shows how many unprocessed notifications we have at time 't'. This number should be close to 0.

queue.notifications.incoming

The is rate of incoming notifications at time t. The default granularity is MINUTE but results can be aggregated based on the granularity query parameter - e.g hourly incoming rate.

queue.notifications.processing

This is an estimation of the time in mSec that was used to process the notification. These values may become incorrect once we have late notifications.

Logs

Kill Bill is configured to output its internal logs as specified by the logback.xml configuration (See docs). The aviate plugin running on each node extracts important information from the logs and computes some metrics to highlight potential issues with warn and error logs that have happened through time.

Those metrics are computed per node. Additionally, the log metrics are all Meter metrics, and so they each provide different time series as specified by the Sample Kind listed above.

The following table lists these metrics:

Metric Name	Description
logs.rates.warn	Represents warnings in logs
logs.rates.error	Represents errors in logs

Metric Name

Description

logs.rates.warn

Represents warnings in logs

logs.rates.error

Represents errors in logs

Servlet Responses

Servlet metrics provide visibility into any of the endpoints exposed by the system, either from Kill Bill core (/1.0/kb) or any plugins exposing endpoints.

These metrics are computed per node. The servlet metrics are Meter metrics, so they each provide different time series as specified by the Sample Kind listed above.

The following table lists these metrics:

Metric Name	Description
servlets.responses.ok	Represents successful responses.
servlets.responses.created	Represents created responses.
servlets.responses.badRequest	Represents bad request responses.
servlets.responses.noContent	Represents no content responses.
servlets.responses.notFound	Represents not found responses.
servlets.responses.serverError	Represents server error responses.
servlets.responses.other	Represents other responses.

Metric Name

Description

servlets.responses.ok

Represents successful responses.

servlets.responses.created

Represents created responses.

servlets.responses.badRequest

Represents bad request responses.

servlets.responses.noContent

Represents no content responses.

servlets.responses.notFound

Represents not found responses.

servlets.responses.serverError

Represents server error responses.

servlets.responses.other

Represents other responses.

Database Connection Pools

Kill Bill uses 3 different database connection pools: main, shiro, and osgi. main and shiro are internal connection pools within Kill Bill core. The osgi connection pool is used by the plugins running on top of the Kill Bill platform for any database calls.

The connection pool metrics are computed per node. The following metrics are Gauge metrics and so they return a single value:

Metric Name Description

Metric Name	Description
main.pool.TotalConnections	Total (created) connections at time 't' for `main` pool
main.pool.ActiveConnections	Active (in use) connections at time 't' for `main` pool
main.pool.IdleConnections	Idle connections at time 't' for `main` pool
osgi.pool.TotalConnections	Total (created) connections at time 't' for `osgi` pool
osgi.pool.ActiveConnections	Active (in use) connections at time 't' for `osgi` pool
osgi.pool.IdleConnections	Idle connections at time 't' for `osgi` pool
shiro.pool.TotalConnections	Total (created) connections at time 't' for `shiro` pool
shiro.pool.ActiveConnections	Active (in use) connections at time 't' for `shiro` pool
shiro.pool.IdleConnections	Idle connections at time 't' for `shiro` pool

main.pool.TotalConnections

Total (created) connections at time 't' for main pool

main.pool.ActiveConnections

Active (in use) connections at time 't' for main pool

main.pool.IdleConnections

Idle connections at time 't' for main pool

osgi.pool.TotalConnections

Total (created) connections at time 't' for osgi pool

osgi.pool.ActiveConnections

Active (in use) connections at time 't' for osgi pool

osgi.pool.IdleConnections

Idle connections at time 't' for osgi pool

shiro.pool.TotalConnections

Total (created) connections at time 't' for shiro pool

shiro.pool.ActiveConnections

Active (in use) connections at time 't' for shiro pool

shiro.pool.IdleConnections

Idle connections at time 't' for shiro pool

The following metrics are timer metrics and provide different data points for the sample kinds listed above:

Metric Name	Description
main.pool.Wait	Wait time to get a connection from the pool.
osgi.pool.Wait	Wait time to get a connection from the pool.
shiro.pool.Wait	Wait time to get a connection from the pool.

Metric Name

Description

main.pool.Wait

Wait time to get a connection from the pool.

osgi.pool.Wait

Wait time to get a connection from the pool.

shiro.pool.Wait

Wait time to get a connection from the pool.

Retrieve Diagnostic Report Examples

The Retrieve Diagnostic Report endpoint generates a diagnostic report. Such a report, when shared with the Kill Bill team, allows the team to replicate a user’s Kill Bill environment and diagnose issues.

This section provides examples of generating the diagnostic report with various parameters.

With Account Data Only

In order to include the account data, the accountId parameter needs to be specified:

curl -X GET \
     -H 'X-killbill-apiKey: bob' \
     -H 'X-killbill-apisecret: lazar' \
     -H "Accept: application/zip"  \
     -H "Authorization: Bearer ${ID_TOKEN}" \
     "http://127.0.0.1:8080/plugins/aviate-plugin/v1/health/diagnostic?accountId=88edb5ba-d613-4923-84dc-b3cee1bf0b42" -JO

This command generates a diagnostic file that includes a file called account.data. This has data from all the KB tables for the specified accountId.

With Configuration Data

The configuration data is controlled via the withKillbillConfig, withTenantConfig and the withSystemConfig query parameters:

curl -X GET \
     -H 'X-killbill-apiKey: bob' \
     -H 'X-killbill-apisecret: lazar' \
     -H "Accept: application/zip"  \
	 -H "Authorization: Bearer ${ID_TOKEN}" \
     "http://127.0.0.1:8080/plugins/aviate-plugin/v1/health/diagnostic?accountId=d0b3e0b9-f4bc-43d4-8001-1969b03b4555&withKillbillConfig=true&withTenantConfig=true" \
	 -JO

This command generates a diagnostic file with the killbill_configuration.data (which includes the global KB configuration), and tenant_config.data (which includes the per-tenant configuration properties like catalog, overdue, etc.) files.

With KB Logs

In order to include the logs in the diagnostic file, the following parameters need to be specified:

withLogs=true
logsDir=path of the directory from which to include the log files. If unspecified, this defaults to the value of com.killbill.billing.plugin.aviate.logsPath property. If this property is not specified, it defaults to TOMCAT_HOME/logs where TOMCAT_HOME is determined via the catalina.base system property
logsFilenames=name of the log file that needs to be included. If multiple log files are required, this parameter needs to be included multiple times corresponding to each log file as shown in the examples below. If unspecified, this defaults to catalina.out.

Depending on the type of KB installation, you can use different curl commands to include the KB logs.

Here are some examples.

Tomcat Setup

Example 1

On a Tomcat setup, the KB/Kaui logs are created in the directory specified in the logback.xml as explained here. Thus, to include these logs, you need to specify the logsDir parameter with the path specified in logback.xml as follows:

curl -X GET \
     -H 'X-killbill-apiKey: bob' \
     -H 'X-killbill-apisecret: lazar' \
     -H "Accept: application/zip"  \
     -H "Authorization: Bearer ${ID_TOKEN}" \
     "http://127.0.0.1:8080/plugins/aviate-plugin/v1/health/diagnostic?accountId=88edb5ba-d613-4923-84dc-b3cee1bf0b42&withLogs=true&logsDir=/var/lib/killbill/&logsFilenames=killbill.out&logsFilenames=kaui.out" -JO

This command generates a diagnostic file with the account data and the killbill.out, kaui.out log files from the /var/lib/killbill directory.

Example 2:

As explained earlier, if the logsDir parameter is not specified, it defaults to the value of the com.killbill.billing.plugin.aviate.logsPath property. Assuming that the KB log files are created in the /var/lib/killbill/logs directory and that the com.killbill.billing.plugin.aviate.logsPath=/var/lib/killbill/logs property is set, you can use the following command to include the KB/Kaui log files:

curl -X GET \
     -H 'X-killbill-apiKey: bob' \
     -H 'X-killbill-apisecret: lazar' \
     -H "Accept: application/zip"  \
     -H "Authorization: Bearer ${ID_TOKEN}" \
     "http://127.0.0.1:8080/plugins/aviate-plugin/v1/health/diagnostic?accountId=88edb5ba-d613-4923-84dc-b3cee1bf0b42&withLogs=true&logsFilenames=killbill.out&logsFilenames=kaui.out" -JO

This command generates a diagnostic file with the account data and the killbill.out, kaui.out log files from the /var/lib/killbill/logs directory.

Example 3

If the com.killbill.billing.plugin.aviate.logsPath property is not set and the logsDir parameter is not specified, it defaults to the TOMCAT_HOME/logs directory. Thus, the localhost.2025-01-01.log file from the Tomcat logs directory can be included using the following command:

curl -X GET \
     -H 'X-killbill-apiKey: bob' \
     -H 'X-killbill-apisecret: lazar' \
     -H "Accept: application/zip"  \
     -H "Authorization: Bearer ${ID_TOKEN}" \
     "http://127.0.0.1:8080/plugins/aviate-plugin/v1/health/diagnostic?accountId=88edb5ba-d613-4923-84dc-b3cee1bf0b42&withLogs=true&logsFilenames=localhost.2025-01-01.log" -JO

This command generates a diagnostic file with the account data and the localhost.2025-01-01.log file from the Tomcat home directory.

Example 4

As mentioned earlier, if the logsFilenames parameter is not specified, it defaults to catalina.out. Thus, to include this file, you can use:

curl -X GET \
     -H 'X-killbill-apiKey: bob' \
     -H 'X-killbill-apisecret: lazar' \
     -H "Accept: application/zip"  \
     -H "Authorization: Bearer ${ID_TOKEN}" \
     "http://127.0.0.1:8080/plugins/aviate-plugin/v1/health/diagnostic?accountId=88edb5ba-d613-4923-84dc-b3cee1bf0b42&withLogs=true" -JO

This command generates a diagnostic file with the account data and the catalina.out file from the Tomcat home directory.

Docker/AWS Setup

In case of a Docker/AWS setup, KB/Kaui logs are created in the $TOMCAT_HOME/logs directory (which is /var/lib/tomcat/logs). Thus, there is no need to explicitly specify the logsDir parameter to include these logs. So you can use:

curl -X GET \
-H 'X-killbill-apiKey: bob' \
-H 'X-killbill-apisecret: lazar' \
-H "Accept: application/zip"  \
-H "Authorization: Bearer ${ID_TOKEN}" \
"http://127.0.0.1:8081/plugins/aviate-plugin/v1/health/diagnostic?accountId=0dc4f698-c6cd-49c6-90e7-f745c70f96cb&withLogs=true&logsFilenames=killbill.out" -JO

This command generates a diagnostic file with the account data and the killbill.out log file from the /var/lib/tomcat/logs directory.

PreviousPrev NextNext