Document Properties
Kbid2L8180
Last Modified22-Jun-2020
Added to KB17-Feb-2017
Public AccessEveryone
StatusOnline
Doc TypeGuidelines, Concepts & Cookbooks
Product
  • IOM 2.2
  • IOM 2.9

Guide - IOM Watchdog 2.2 - 2.11

1 Introduction

This document describes the general concept and usage of the IOM Watchdog.

Watchdog is a tool of the IOM to monitor and manage (stop/ start/ restart) the availability of an IOM application server which is a required characteristic for high availability systems. Each application server requires to run its own Watchdog. To fulfill its tasks Watchdog uses health check requests to each application server and several configurable properties.

The guide is mainly addressed to operations and developers.

1.1 Glossary

WordingDescription
IOM

The abbreviation for Intershop Order Management

OMSThe abbreviation for Order Management System, the technical name of IOM

1.2 References

2 Installation Properties

$OMS_ETC/installation.properties provides the property WATCHDOG_JAVA_OPTS. This property controls options to be passed to the Java process running the watchdog. Also see Guide - Intershop Order Management - Installation.

3 Watchdog Properties

Watchdog offers two types of properties which can be configured in  watchdog.properties:

  • watchdog-specific properties to configure the watchdog
  • standard properties to configure the logging

The following table lists watchdog-specific properties only:

PropertyDefaultDescription
watchdog.process.cmd

Command line of the process to be started/ watched. Will be called as argument to $SHELL.
Additional command line parameters are not supported.

watchdog.restart.delay30

If the started/ watched process becomes not ready or healthy, the restart of the process will be delayed by watchdog.restart.delay * min(<number of failed tries>,watchdog.max.delay.factor) seconds. <number of failed tries> will be reset to 0, if the watched/ started process becomes healthy again.

watchdog.max.delay.factor10see watchdog.restart.delay. Setting watchdog.max.delay.factor to 10, leads to a maximum delay between restarts of IOM application server of 300 seconds (5 minutes), if watchdog.restart.delay is set to 30.
watchdog.start.timeout300

Number of seconds to wait for started process to become ready. If start timeout is reached, but process is not ready, process will be killed and restarted.

watchdog.stop.timeout20Number of seconds to wait for process to stop before trying again.
watchdog.healthcheck.url-URL to be requested to check health of started/ watched process. See Concept - IOM Server Health Check and IOM REST API - Get Server Health Status for URL of IOM.
watchdog.cycle10Number of seconds between two health checks and updates of servers state.
watchdog.healthcheck.timeout30

Number of seconds before failed health checks lead to restart of the watched process. watchdog.healthcheck.timeout has to be larger than watchdog.cycle

watchdog.healthcheck.connect.timeout5Number of seconds before giving up when connecting/ receiving health check.
watchdog.healthcheck.read.timeout5see watchdog.healthcheck.connect.timeout
watchdog.failover.enabledfalseIf set to true, watchdog connects the database to ensure, only the active server is started.
watchdog.failover.serverid

Content of environment variable SERVER_ID

(provided by set_env.sh, read from property SERVER_ID of file installation.properties)

SERVER_ID is used to identify the app-server within the database.
watchdog.failover.db.hostlist

Content of environment variable PGHOSTLIST

(provided by set_env.sh, read from property is.oms.db.hostlist of file cluster.properties)

Hostlist is required to connect the database in order to read/update failover information.
watchdog.failover.db.name

Content of environment variable PGDATABASE

(provided by set_env.sh, read from property is.oms.db.name of file cluster.properties)

Database name is required to connect the database in order to read/update failover information.

watchdog.failover.db.user

Content of environment variable PGUSER

(provided by set_env.sh, read from property is.oms.db.user of file cluster.properties)

Database user is required to connect the database in order to read/update failover information.
watchdog.failover.db.pass

Content of environment variable PGPASSWORD

(provided by set_env.sh, read from property is.oms.db.pass of file cluster.properties)

Database password is required to connect the database in order to read/update failover information.
watchdog.failover.dbconnect.timeout10Number of seconds before connect timeout is detected.
watchdog.failover.timeout

20

Number of seconds before a registration as active server times out. watchdog.failover.timeout has to be larger than watchdog.cycle.

Watchdog uses java.util.logging.Logger to write messages.

The logger system provides the whole infrastructure to control what and where to log. A basic configuration is part of watchdog.properties, but can be extended by customers, using standard mechanisms.

Examplary Logging Configuration
#-------------------------------------------------------------------------------
# Logging configuration
#-------------------------------------------------------------------------------
# following classes and levels exists and can be configured:
# classes:
#   com.intershop.oms.watchdog.FailoverDatabase
#   com.intershop.oms.watchdog.HealthCheckRequest
#   com.intershop.oms.watchdog.Watchdog
#   com.intershop.oms.watchdog.WatchdogConfig
#   com.intershop.oms.watchdog.WatchedProcess
# levels:
#   FINE - detailed tracing information
#   INFO - information about configuration and state changes
#   WARNING  - information about handled exceptions
#   SEVERE - information about unhandled expections
.level = INFO
handlers = java.util.logging.FileHandler, java.util.logging.ConsoleHandler
java.util.logging.SimpleFormatter.format = [%1$td/%1$tm/%1tY %1$tH:%1$tM:%1$tS %1$tz] %4$s: "%2$s:" "%5$s" %6$s%n
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.ConsoleHandler.level = WARNING
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.FileHandler.pattern = watchdog.log
java.util.logging.FileHandler.level = INFO
java.util.logging.FileHandler.append = true

4 Failover

Failover guarantees, that only one application server is running. Any other application servers are stopped and waiting to take over the role of the active server. Since Watchdog is responsible to start/ stop the application server, Watchdog is also responsible to control the active-state of the application server. All watchdogs are controlling backend servers within an IOM cluster, have to communicate with each other. They have to negotiate which of them is running the active application server. Also see Guide - Intershop Order Management - Technical Overview section High Availability.

The required communication is using the IOM database. There are only two operations, the watchdog has to implement:

4.1 BecomeActiveServer

Before Watchdog is allowed to start the application server (watched process), it has to ensure, that the server is the active one. BecomeActiveServer sets the current server (property watchdog.failover.serverid) in active-state, if all preconditions are satisfied.

If no server is currently active, the current server is made the active one. If current server is already active, the state will be refreshed. Exceptions must handled internally, since watchdog needs to remain working even if no database is available.

Returns true only, if current server became active or the active-state of current server was refreshed.

BecomeActiveServer has to ensure, that parallel execution by different watchdogs returns true to only one caller.

4.2 RefreshActiveServer

If Watchdog has already started the application server (the server became active before), its active-state has to be refreshed before it times out. RefreshActiveServer must not refresh the active-state, if it is already timed out. Exceptions have to be handled internally. Inconsistencies in data has to be handled too: if another server is marked active too, own server must not be refreshed.

Returns true only, if active-state of server was refreshed.

5 Watchdog Process Watch

The following diagram shows the main aspects, how Watchdog is controlling the watched process.


watchdog processing

6 Process Architecture

The integration of Watchdog to IOM requires some additional processes.

Watchdog is started by systemd. Systemd gets all the required information from oms-watchdog.service, which has to be created from $OMS_ETC/oms-watchdog.service.template during installation. It has to be updated, whenever settings of OMS_HOME or OMS_ETC have changed).

Watchdog.sh has a built in a mechanism, to work without OMS_HOME and OMS_ETC too. If the variables are not set, watchdog.sh looks up set_env.sh relative to watchdog.sh. Watchdog.sh itself starts the Java based watchdog program.

The Java-based watchdog is the real watchdog program, starting/ stopping the application server and checking health and readiness.

Watchdog.sh is also responsible to set additional options at the Java-based watchdog. These options can be defined at property WATCHDOG_JAVA_OPTS in $OMS_ETC/installation.properties.

Also see Guide - Setup Intershop Order Management 2.2.

6.1 Managing Sub-processes

The Java-based watchdog starts standalone.sh and has to stop it on failure. But when trying to stop it, only standalone.sh is killed. The java-process started by standalone.sh remains alive. To overcome this problem (without the need to modify standalone.sh or accessing the internal Java class UNIXProcess) an additional program is needed, which starts standalone.sh and kills all sub-processes. This additional program is named standalone_starter.sh. Standalone_starter.sh only starts $JBOSS_HOME/bin/standalone.sh, but additionally provides a handler to be executed on exit, which kills the sub-processes.

Watchdog processes

Disclaimer

The information provided in the Knowledge Base may not be applicable to all systems and situations. Intershop Communications will not be liable to any party for any direct or indirect damages resulting from the use of the Customer Support section of the Intershop Corporate Web site, including, without limitation, any lost profits, business interruption, loss of programs or other data on your information handling system.

Customer Support
Knowledge Base
Product Resources
Support Tickets