Overview

Best practices

Kill Bill is fundamentally a backend system, so the following considerations should apply:

  • Don’t make the system visible to the outside world, instead you should have a front end system, or a reverse proxy in front of it (e.g. if you need to handle gateway notifications like PayPal IPN/Internal Payment Notification).

  • Always deploy at least 2 instances (for reliability purposes) in front of a load balancer.

  • Setup your database with a master and a slave instances, and configure it to take regular snapshots.

  • Aggregate your logs across instances (for example using LogStash or GrayLog2).

  • Look at the all the existing Kill Bill JMX metrics (using VisualVM, jconsole, …​) and set some alerts (for instance by using the following script).

  • Configure your system properly, especially when it comes to the settings of the bus and notification queue.

Pre- and Post-deployment checklist

  • Test, test, test: we work hard to make the core of the platform solid. But your deployment, with your own combination of plugins and configuration (catalog, overdue, etc.) is unique. Write business-level regression tests that you can run before each upgrade.

  • Timezone of your servers and databases has to be UTC and ensure NTP is properly configured to avoid any drift between instances.

  • Consider encrypting sensitive configuration properties, like database passwords, with Jasypt. See http://www.jasypt.org/cli.html for details on how to generate encrypted values. Example configuration:

-Dorg.killbill.server.enableJasypt=true
-Djasypt.encryptor.password=myTopSecretPassword
-Djasypt.encryptor.algorithm=PBeWithshA1andDeSede
-Dorg.killbill.billing.osgi.dao.password=ENC(+PeDGcp3DTonUpB3lPuoegack9kb0hJi)
  • Make sure the servers have enough entropy: /proc/sys/kernel/random/entropy_avail should be > 3k (otherwise install haveged / rng-tools). Kill Bill should also be started with -Djava.security.egd=file:/dev/./urandom.

  • Adjust org.killbill.security.shiroNbHashIterations as needed. This setting configures the number of iterations run to hash API secrets and user passwords. The default value is high for security reasons, but can be adjusted down if required (e.g. for Docker -e KILLBILL_SECURITY_SHIRO_NB_HASH_ITERATIONS=1) as this can have a significant performance impact. Note that changing the value requires re-hashing manually all tenants secrets and user passwords.

  • Make sure your database and queues configuration are adequate: the bus_events table should almost always be empty and the notifications table should never have any AVAILABLE entry with an effective date in the past. Otherwise, in both cases, the system will be late (invoices not generated, etc.). These two metrics should always be monitored in production (potentially a paging event).

  • Verify the integration with your payment gateway(s): very few payment transactions (if any) should be in an UNKNOWN state. Make sure to fix these manually via the Payment Admin API, if the plugin is unable to do it automatically.

  • Have a monitoring system in place (we recommend Elasticseach, Logstash, Kibana, InfluxDB and Grafana, which can be easily setup for Kill Bill) and watch your logs constantly: any WARN or ERROR entry should be reviewed, as well as stacktraces.

  • Monitor metrics at /1.0/metrics and integrate the healthcheck at /1.0/healthcheck with your load balancer.

  • Join our mailing-list to get notified of new releases or ask questions. For inquiries regarding your specific setup, always attach a KPM diagnostic output:

kpm diagnostic --killbill-api-credentials=bob lazar \
               --killbill-credentials=admin password \
               --killbill-url=http://127.0.0.1:8080 \
               --killbill-web-path=/var/lib/tomcat7/webapps/ROOT/ \
               --kaui-web-path=/var/lib/tomcat7/webapps/ROOT/ \
               --log-dir=/var/lib/tomcat7/logs \
               --account_export=ACCOUNT_ID

Deployment options

When deploying Kill Bill, the following pieces will need to be deployed in addition to your OS, VM or container (Docker, …​):

Because a lot of pieces need to be in place, and because a lot can go wrong, users are strongly encouraged to use our pre-built Docker images and Docker Compose recipes. We also maintain recipes to deploy the open-source Elastic stack and the open-source InfluxData stack integrated with Kill Bill.

Experienced users can also opt for our Ansible playbooks (Ansible is an open-source IT Automation tool), to install and configure Apache Tomcat, KPM and Kill Bill individually.

Note that both of these deployment options rely on KPM, the Kill Bill Package Manager. KPM can fetch existing (signed) artifacts for the main killbill war and for each of the plugins, and deploy them at the right place. Using KPM gives you the most flexibility without having to re-invent the wheel, but significantly increases the deployment complexity. To be directly used by our most advanced users only.

Docker

Make sure to get familiar with Docker first, before attempting to install Kill Bill. The project has lots of great docs to help you get setup.

Documentation for our images is available here. Our blog has also lots of tips on how to deploy to popular cloud providers. Our Docker Compose recipes work with all Docker Machine supported platforms.

Container configuration can be done by bind mounting a custom killbill.properties (for configuration) and/or a custom kpm.yml (to specify plugins to install) file: -v /path/to/killbill.properties:/var/lib/killbill/killbill.properties -v /path/to/kpm.yml:/var/lib/killbill/kpm.yml. Note that on MacOS, /path/to must be under /Users.

Database engine

By default, Kill Bill expects to run against a MySQL or MariaDB database. Our DDL is very simple by design though, so it is very easy to adapt it for other RDBMS.

Specifically, for PostgreSQL, you just need to install the ddl-postgresql bridge, before installing the main DDL.

The Kill Bill core team runs regression tests against both MariaDB 10.3 and PostgreSQL 10, but users have successfully deployed Kill Bill with Oracle MySQL, Percona, Aurora, etc.

Failover

We’ve tested various failover scenarii (Aurora RDS, master/slave MariaDB Docker setup and master/slave Percona Server on real hardware) and could confirm that Kill Bill is behaving as expected, i.e. queries in-flight will fail during a failover, but reconnection is automatic.

Specifically for Aurora though, we did notice that:

  • Reconnection is r/o by default after the failover. jdbc:mysql:aurora: must be specified in the JDBC url for the reconnection to be r/w.

  • Triggering a failover in the RDS UI leads to a pretty short Kill Bill downtime (few secs). Terminating the master though ("delete instance") takes a bit longer (few minutes) — this could be mitigated with more aggressive timeouts in the JDBC pool.

Bus and Notification queues

Bus events

The notifications across Kill Bill core services rely on a proprietary bus event. There are actually 2 buses, the main bus which is used by core services and an external bus which is used by plugins. The main reason for having 2 buses is that the main bus is critical for internal operations to work, and so we want to prevent plugin code that could interact with 3rd party systems to block on long operations and impact the rest of the system.

There are 2 sets of two tables to manage those bus events:

  • For the main bus, a bus_events and a bus_events_history table.

  • For the external bus, a bus_ext_events and a bus_ext_events_history table.

Events are moved from the bus_events to the bus_events_history as they are processed. That allows to keep a history of what happened in the system and avoid having the bus_events table grow too much. The bus_events_history is only there for debugging and is never used by the system.

Bus Event Modes

The bus event can be run in multiple modes (instanceName below is either main or external):

  • POLLING: the bus will poll the database for new available entries and dispatch them across the nodes.

  • STICKY_POLLING: the bus will poll the database for new available entries and dispatch them to the same node that created the entry.

  • STICKY_EVENTS (default mode): in that mode, the bus now behaves as a blocking queue where entries are dispatched as soon as they have been committed to disk. This is a much more efficient mechanism both in terms of latency (because entries are picked up right away) and throughput (because there is no time for entries to accumulate).

In a cloud environment, where nodes are more prone to appear and disappear, the following choices are available:

  • Use the POLLING mode

  • Use the STICKY_EVENTS (or STICKY_POLLING) mode. In that scenario, you need to be cautious of Kill Bill instances restarting on a different node:

  • Each instance can be started with a specific system property org.killbill.queue.creator.name=<MY_VIRTUAL_INSTANCE_NAME>, which overrides the creating_owner value string associated with each entry which defaults to the hostname of the machine. When using that property, an instance that restarts on a different node but with the same property will continue processing the same entries.

  • Or, alternatively if failovers don’t occur too often, run a query to update rows associated with the instance that failed over so they get picked by an other node. Note that events are never lost because they are persistent, but in that case, they may linger until updated. The query to update the rows is the following (only showed for bus_events table, but similar query needs to happen for bus_events_history):

update bus_events set creating_owner='MY_NEW_NODE_HOSTNAME', processing_available_date=NULL, processing_state = 'AVAILABLE', processing_owner=NULL where creating_owner='MY_INSTANCE_NAME_THAT_FAILED';

Future Notifications

Overview

In addition to the bus events, which are dispatched immediately, Kill Bill also manages future notifications. The mechanism is very similar to the POLLING we described earlier, but the main difference is that those notifications are dispatched when the effective_date of the notification has been reached. There is no STICKY_EVENTS mode for the future notifications.

The future notifications also rely on two tables: the notifications and notifications_history, and the mechanism to move processed entries is similar to what we described for the bus event.

Logging and GDPR

If you are using Tomcat, CATALINA_BASE/logs/catalina.out does not rotate. Make sure to make your main appender ch.qos.logback.core.rolling.RollingFileAppender instead of the default ch.qos.logback.core.ConsoleAppender (STDOUT/STDERR is redirected to CATALINA_BASE/logs/catalina.out).

Make sure also to install both the Felix Log bundle and the Kill Bill Log bundle in your platform directory (/var/tmp/bundles/platform by default), otherwise OSGI logs (including from JRuby plugins) will end up in STDOUT/STDERR (hence in CATALINA_BASE/logs/catalina.out). Both bundles are included in the defaultbundles package.

Mask PANs

Use the converter class org.killbill.billing.server.log.obfuscators.ObfuscatorConverter.

If you are passing PANs via plugin properties, make sure to disable query parameters logging in Tomcat. Use the following org.apache.catalina.valves.AccessLogValve pattern: %h %l %u %t "%m %U" %s %b %D.

Redirect plugin logs to a different file

<configuration debug="true">
    <appender name="MAIN" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <filter class="ch.qos.logback.core.filter.EvaluatorFilter">
            <evaluator name="loggingTaskEval">
                <expression>
                <![CDATA[
                    message!=null &&
                    message.contains("[cybersource-plugin]")
                ]]>
                </expression>
            </evaluator>
            <OnMatch>DENY</OnMatch>
        </filter>
        <file>${LOGS_DIR:-./logs}/killbill.out</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>${LOGS_DIR:-./logs}/killbill-%d{yyyy-MM-dd}.%i.out.gz</fileNamePattern>
            <maxHistory>3</maxHistory>
            <cleanHistoryOnStart>true</cleanHistoryOnStart>
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>100MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
        </rollingPolicy>
        <encoder>
            <pattern>%date [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <appender name="CYBERSOURCE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <filter class="ch.qos.logback.core.filter.EvaluatorFilter">
            <evaluator name="loggingTaskEval">
                <expression>
                <![CDATA[
                    message!=null &&
                    message.contains("[cybersource-plugin]")
                ]]>
                </expression>
            </evaluator>
            <OnMismatch>DENY</OnMismatch>
        </filter>
        <file>${LOGS_DIR:-./logs}/cybersource.out</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>${LOGS_DIR:-./logs}/cybersource-%d{yyyy-MM-dd}.%i.out.gz</fileNamePattern>
            <maxHistory>3</maxHistory>
            <cleanHistoryOnStart>true</cleanHistoryOnStart>
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>100MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
        </rollingPolicy>
        <encoder>
            <pattern>%date [%thread] %msg%n</pattern>
        </encoder>
    </appender>

    <root level="INFO">
       <appender-ref ref="MAIN" />
       <appender-ref ref="CYBERSOURCE" />
    </root>
</configuration>

Handling plugin logs

In order for plugin logs to be handled by the main logger, make sure to:

  • Install Apache Felix Log under /var/tmp/bundles/platform (provided in the default plugins package)

  • Install killbill-platform-osgi-bundles-logger under /var/tmp/bundles/platform (also provided in the default plugins package)

  • Add org.osgi.service.log to Import-Package in your MANIFEST.MF

  • Add the following dependencies in compile scope in your plugin:

<dependency>
    <groupId>org.kill-bill.billing</groupId>
    <artifactId>killbill-platform-osgi-bundles-lib-killbill</artifactId>
</dependency>
<dependency>
    <groupId>org.kill-bill.billing</groupId>
    <artifactId>killbill-platform-osgi-bundles-lib-slf4j-osgi</artifactId>
</dependency>

Reverse Proxy

We recommend setting up NGINX to forward external notifications to Kill Bill.

Here’s a working example for Adyen:

server {
  listen       443;
  server_name  killbill-public.acme.com;

  location /notifications/killbill-adyen {
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;

      proxy_set_header Authorization "Basic YWRtaW46cGFzc3dvcmQ=";
      proxy_set_header X-Killbill-ApiKey bob;
      proxy_set_header X-Killbill-ApiSecret lazar;
      proxy_set_header X-Killbill-CreatedBy Adyen;
      proxy_pass http://killbill-internal.acme.com:8080/1.0/kb/paymentGateways/notification/killbill-adyen;

      proxy_hide_header Set-Cookie;
      proxy_hide_header Access-Control-Allow-Origin;
      proxy_hide_header Access-Control-Allow-Methods;
      proxy_hide_header Access-Control-Allow-Headers;
      proxy_hide_header Access-Control-Expose-Headers;
      proxy_hide_header Access-Control-Allow-Credentials;
  }
}

Service Discovery with Eureka

For easier integration into a microservice architecture, Kill Bill supports client-side service discovery via a Eureka registry. A module (disabled by default) is provided that allows Kill Bill to register with a Eureka server.

To register as a Eureka client, first add the following dependency to your profile:

<dependency>
    <groupId>org.kill-bill.billing</groupId>
    <artifactId>killbill-platform-service-registry</artifactId>
</dependency>

Next, add the Eureka Guice module to the module list in your server module (i.e. KillbillServerModule.java)

 install(new EurekaModule(configSource));

Finally, add the Eureka client config properties to killbill.properties. For example, assuming a Eureka server is running on port 8761 and Kill Bill is on port 8080:

eureka.serviceUrl.default=http://localhost:8761/eureka

eureka.registration.enabled=true
eureka.name=killbill
eureka.port=8080
eureka.port.enabled=true
eureka.securePort.enabled=false

eureka.statusPageUrlPath=/1.0/metrics
eureka.healthCheckUrlPath=/1.0/healthCheck

eureka.decoderName=JacksonJson
eureka.preferSameZone=true
eureka.shouldUseDns=false

Enabling HTTPS

You first need to import your SSL certificate (see docs). For testing, you can just create a self-signed certificate. For example, on Ubuntu or our Docker images:

sudo apt-get update
sudo apt-get install ssl-cert
sudo usermod -a -G ssl-cert tomcat

Then, update Tomcat’s configuration (/var/lib/tomcat/conf/server.xml in our Docker images):

<Connector executor="tomcatThreadPool"
           port="8443"
           connectionTimeout="20000"
           acceptorThreadCount="2"
           SSLEnabled="true"
           SSLCertificateFile="/etc/ssl/certs/ssl-cert-snakeoil.pem"
           SSLCertificateKeyFile="/etc/ssl/private/ssl-cert-snakeoil.key"
           scheme="https"
           secure="true" />

Finally, make sure port 8443 is open (and exported from the Docker containers).

X-Forwarded headers support

When org.killbill.jaxrs.location.full.url=true, Kill Bill will build location headers using a full URL. In a typical load balancer scneario, which receives traffic on port 8443 and forwards it to port 8080 on the Kill Bill instances (i.e. SSL terminated at the load balancer), you probably want the headers to return something like https://killbill-vip.mycompany.net:8443 instead of http://10.1.2.3:8080.

To do so:

  1. Enable the RemoteIpValve in your Tomcat’s configuration (/var/lib/tomcat/conf/server.xml in our Docker images). Make sure to configure correctly at least the internalProxies and trustedProxies attributes depending on your environment, see the docs.

<Valve className="org.apache.catalina.valves.RemoteIpValve"
       protocolHeader="x-forwarded-proto"
       portHeader="x-forwarded-port" />
  1. Set org.killbill.jaxrs.location.host=killbill-vip.mycompany.net

Without any X-Forwarded header, the default Location header will result to something like http://killbill-vip.mycompany.net:8080. With X-Forwarded-For: 10.0.0.0, X-Forwarded-Proto: https and X-Forwarded-Port: 8443, the header will become something like https://killbill-vip.mycompany.net:8443.

You optionally also want to set requestAttributesEnabled="true" to org.apache.catalina.valves.AccessLogValve, to log the IP address from the X-Forwarded-For header in the access logs.

Nagios integration

To integrate JMX beans with Nagios, download the plugin from https://github.com/killbill/nagios-jmx-plugin:

# Whether the persistent bus is turned on (warns if off)
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -O org.killbill.bus.api:name=PersistentBus -A NotificationProcessingSuspended -w false
# Whether the notification queue is turned on (warns if off)
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -O org.killbill.notificationq.api:name=NotificationQueueService -A NotificationProcessingSuspended -w false
# Generic Kill Bill healthcheck, checks the overall state of the application (warns if unhealthy)
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -O org.killbill.billing.server.healthchecks:name=KillbillHealthcheck -A Healthy -w true
# Monitors the size of the notification queue. Warning and Critical alerts often mean an overload of the system
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -O metrics:name=org.killbill.notificationq.NotificationQueueDispatcher.pending-notifications -A Value -w 50 -c 100

Other interesting metrics (use of the -P flag to get Nagios performance data):

./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -P -O 'java.lang:type=ClassLoading' -A LoadedClassCount
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -P -O 'java.lang:type=Compilation' -A TotalCompilationTime
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -P -O 'java.lang:type=OperatingSystem' -A SystemCpuLoad
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -P -O 'java.lang:type=Runtime' -A Uptime
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -P -O 'java.lang:type=Threading' -A ThreadCount
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -P -O 'java.nio:type=BufferPool,name=direct' -A MemoryUsed
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -P -O 'java.nio:type=BufferPool,name=mapped' -A MemoryUsed
./check_jmx_ng -v -U service:jmx:rmi:///jndi/rmi://127.0.0.1:8989/jmxrmi -P -O 'metrics:name=org.killbill.bus.dao.PersistentBusSqlDao.getReadyEntries' -A 95thPercentile