This is an Aviate-only feature.

Introduction

Aviate Health is a feature provided by the Aviate plugin. It is designed to give users valuable insights into the overall health and performance of a Kill Bill installation. As part of this feature, the aviate plugin exposes dedicated health-related endpoints that allow you to monitor and manage various aspects of your Kill Bill setup. Some of the capabilities offered by this feature include viewing detailed health metrics for all nodes within a Kill Bill installation, identifying and resolving issues such as stuck bus or notification entries, and generating comprehensive diagnostic reports to aid in troubleshooting and system optimization.

Getting Started with Aviate Health

This section provides a step-by-step approach to start using Aviate Health.

Installing the Plugin

The Aviate plugin can be installed as documented in the How to Install the Aviate Plugin doc.

Enabling Aviate Health

When the Aviate plugin is installed, Aviate Heath is enabled by default.

The following configuration property controls this feature:

com.killbill.billing.plugin.aviate.enableHealthReporter=true

Refer to the Kill Bill Configuration Guide to know more about setting configuration properties.

Using Health APIs

As mentioned earlier, Aviate Health exposes endpoints that allow monitoring the health of a KB installation and fixing problems if any. Once the aviate plugin is installed, you can start using the Aviate Health APIs. These are documented in our api docs.

Aviate Metrics

The Aviate plugin provides a Retrieve Metrics endpoint that can be used to assess the health of the system.

This section provides some insights into the metrics that can be retrieved via this endpoint.

Metrics Overview

The metrics exposed by the Aviate plugin can mainly be categorized in the following event groups (types):

  • Gauge - A gauge return a single numerical value per minute/hour/day (based on the value of the granularity parameter)

  • Meter - A meter provide the rate over time. They provide different data points for the following sample kinds:

    Sample Kind Description

    count

    Monotonic increasing value since last reboot.

    {one_minute/five_minute/fifteen_minute}_rate

    Rate through a window of time.

    mean_rate

    Mean rate since last reboot.

  • Timer - A timer measures the rate of events and the duration of those events. They provide different data points for the following sample kinds:

    Sample Kind Description

    mean_rate

    Mean rate since last reboot.

    {one_minute/five_minute/fifteen_minute}_rate

    Rate through a window of time.

    tp99, tp999, tp75, tp98, tp95

    Percentile for the metrics since last reboot.

    min

    Min value since last reboot.

    max

    Max value since last reboot.

    count

    Monotonic increasing value since last reboot.

    median

    Median value since last reboot.

    std_dev

    Standard deviation since last reboot.

Queue Metrics

Queue metrics can be used to assess the health of the Kill Bill internal queues.

Kill Bill has its own internal queues used to dispatch events. Events that are dispatched right away as a result of some internal state being created or updated are called bus events - e.g. a subscription_creation event is generated as a result of creating a new subscription. Events that are scheduled to be dispatched in the future are called notifications - e.g. invoice scheduled on a periodic basis matching account settings and plan billing periods. The health of these internal queues is critical to maintaining correct functioning of the system.

Note that the metrics associated with the queues are global (as opposed to computed per node) so the nodeName query parameter will be ignored. Additionally, all the queue metrics are of Gauge and therefore return a single value.

The following table lists these metrics:

Metric Name Description

queue.bus.late

This is a counter that shows how many unprocessed bus events we have at time 't'. This number should be close to 0.

queue.bus.incoming

The is rate of incoming bus events at time t. The default granularity is MINUTE but results can be aggregated based on the granularity query parameter - e.g hourly incoming rate.

queue.bus.processing

This is an estimation of the time in mSec that was used to process the bus event. These values may become incorrect once we have late bus entries.

queue.notifications.late

This is a counter that shows how many unprocessed notifications we have at time 't'. This number should be close to 0.

queue.notifications.incoming

The is rate of incoming notifications at time t. The default granularity is MINUTE but results can be aggregated based on the granularity query parameter - e.g hourly incoming rate.

queue.notifications.processing

This is an estimation of the time in mSec that was used to process the notification. These values may become incorrect once we have late notifications.

Logs

Kill Bill is configured to output its internal logs as specified by the logback.xml configuration (See docs). The aviate plugin running on each node extracts important information from the logs and computes some metrics to highlight potential issues with warn and error logs that have happened through time.

Those metrics are computed per node. Additionally, the log metrics are all Meter metrics, and so they each provide different time series as specified by the Sample Kind listed above.

The following table lists these metrics:

Metric Name Description

logs.rates.warn

Represents warnings in logs

logs.rates.error

Represents errors in logs

Servlet Responses

Servlet metrics provide visibility into any of the endpoints exposed by the system, either from Kill Bill core (/1.0/kb) or any plugins exposing endpoints.

These metrics are computed per node. The servlet metrics are Meter metrics, so they each provide different time series as specified by the Sample Kind listed above.

The following table lists these metrics:

Metric Name Description

servlets.responses.ok

Represents successful responses.

servlets.responses.created

Represents created responses.

servlets.responses.badRequest

Represents bad request responses.

servlets.responses.noContent

Represents no content responses.

servlets.responses.notFound

Represents not found responses.

servlets.responses.serverError

Represents server error responses.

servlets.responses.other

Represents other responses.

Database Connection Pools

Kill Bill uses 3 different database connection pools: main, shiro, and osgi. main and shiro are internal connection pools within Kill Bill core. The osgi connection pool is used by the plugins running on top of the Kill Bill platform for any database calls.

The connection pool metrics are computed per node. The following metrics are Gauge metrics and so they return a single value:

Metric Name Description

main.pool.TotalConnections

Total (created) connections at time 't' for main pool

main.pool.ActiveConnections

Active (in use) connections at time 't' for main pool

main.pool.IdleConnections

Idle connections at time 't' for main pool

osgi.pool.TotalConnections

Total (created) connections at time 't' for osgi pool

osgi.pool.ActiveConnections

Active (in use) connections at time 't' for osgi pool

osgi.pool.IdleConnections

Idle connections at time 't' for osgi pool

shiro.pool.TotalConnections

Total (created) connections at time 't' for shiro pool

shiro.pool.ActiveConnections

Active (in use) connections at time 't' for shiro pool

shiro.pool.IdleConnections

Idle connections at time 't' for shiro pool

The following metrics are timer metrics and provide different data points for the sample kinds listed above:

Metric Name Description

main.pool.Wait

Wait time to get a connection from the pool.

osgi.pool.Wait

Wait time to get a connection from the pool.

shiro.pool.Wait

Wait time to get a connection from the pool.