This document describes the general concept and usage of the IOM Watchdog.
Watchdog is a tool of the IOM to monitor and manage (stop/ start/ restart) the availability of an IOM application server which is a required characteristic for high availability systems. Each application server requires to run its own Watchdog. To fulfill its tasks Watchdog uses health check requests to each application server and several configurable properties.
The guide is mainly addressed to operations and developers.
Wording | Description |
---|---|
IOM | The abbreviation for Intershop Order Management |
OMS | The abbreviation for Order Management System, the technical name of IOM |
$OMS_ETC/installation.properties provides the property WATCHDOG_JAVA_OPTS. This property controls options to be passed to the Java process running the watchdog. Also see Guide - Intershop Order Management - Installation.
Watchdog offers two types of properties which can be configured in watchdog.properties:
The following table lists watchdog-specific properties only:
Property | Default | Description |
---|---|---|
watchdog.process.cmd | Command line of the process to be started/ watched. Will be called as argument to $SHELL. | |
watchdog.restart.delay | 30 | If the started/ watched process becomes not ready or healthy, the restart of the process will be delayed by watchdog.restart.delay * min(<number of failed tries>,watchdog.max.delay.factor) seconds. <number of failed tries> will be reset to 0, if the watched/ started process becomes healthy again. |
watchdog.max.delay.factor | 10 | see watchdog.restart.delay. Setting watchdog.max.delay.factor to 10, leads to a maximum delay between restarts of IOM application server of 300 seconds (5 minutes), if watchdog.restart.delay is set to 30. |
watchdog.start.timeout | 300 | Number of seconds to wait for started process to become ready. If start timeout is reached, but process is not ready, process will be killed and restarted. |
watchdog.stop.timeout | 20 | Number of seconds to wait for process to stop before trying again. |
watchdog.healthcheck.url | - | URL to be requested to check health of started/ watched process. See Concept - IOM Server Health Check and IOM REST API - Get Server Health Status for URL of IOM. |
watchdog.cycle | 10 | Number of seconds between two health checks and updates of servers state. |
watchdog.healthcheck.timeout | 30 | Number of seconds before failed health checks lead to restart of the watched process. watchdog.healthcheck.timeout has to be larger than watchdog.cycle |
watchdog.healthcheck.connect.timeout | 5 | Number of seconds before giving up when connecting/ receiving health check. |
watchdog.healthcheck.read.timeout | 5 | see watchdog.healthcheck.connect.timeout |
watchdog.failover.enabled | false | If set to true, watchdog connects the database to ensure, only the active server is started. |
watchdog.failover.serverid | Content of environment variable SERVER_ID (provided by set_env.sh, read from property SERVER_ID of file installation.properties) | SERVER_ID is used to identify the app-server within the database. |
watchdog.failover.db.hostlist | Content of environment variable PGHOSTLIST (provided by set_env.sh, read from property is.oms.db.hostlist of file cluster.properties) | Hostlist is required to connect the database in order to read/update failover information. |
watchdog.failover.db.name | Content of environment variable PGDATABASE (provided by set_env.sh, read from property is.oms.db.name of file cluster.properties) | Database name is required to connect the database in order to read/update failover information. |
watchdog.failover.db.user | Content of environment variable PGUSER (provided by set_env.sh, read from property is.oms.db.user of file cluster.properties) | Database user is required to connect the database in order to read/update failover information. |
watchdog.failover.db.pass | Content of environment variable PGPASSWORD (provided by set_env.sh, read from property is.oms.db.pass of file cluster.properties) | Database password is required to connect the database in order to read/update failover information. |
watchdog.failover.dbconnect.timeout | 10 | Number of seconds before connect timeout is detected. |
watchdog.failover.timeout | 20 | Number of seconds before a registration as active server times out. watchdog.failover.timeout has to be larger than watchdog.cycle. |
Watchdog uses java.util.logging.Logger to write messages.
The logger system provides the whole infrastructure to control what and where to log. A basic configuration is part of watchdog.properties, but can be extended by customers, using standard mechanisms.
#------------------------------------------------------------------------------- # Logging configuration #------------------------------------------------------------------------------- # following classes and levels exists and can be configured: # classes: # com.intershop.oms.watchdog.FailoverDatabase # com.intershop.oms.watchdog.HealthCheckRequest # com.intershop.oms.watchdog.Watchdog # com.intershop.oms.watchdog.WatchdogConfig # com.intershop.oms.watchdog.WatchedProcess # levels: # FINE - detailed tracing information # INFO - information about configuration and state changes # WARNING - information about handled exceptions # SEVERE - information about unhandled expections .level = INFO handlers = java.util.logging.FileHandler, java.util.logging.ConsoleHandler java.util.logging.SimpleFormatter.format = [%1$td/%1$tm/%1tY %1$tH:%1$tM:%1$tS %1$tz] %4$s: "%2$s:" "%5$s" %6$s%n java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter java.util.logging.ConsoleHandler.level = WARNING java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter java.util.logging.FileHandler.pattern = watchdog.log java.util.logging.FileHandler.level = INFO java.util.logging.FileHandler.append = true
Failover guarantees, that only one application server is running. Any other application servers are stopped and waiting to take over the role of the active server. Since Watchdog is responsible to start/ stop the application server, Watchdog is also responsible to control the active-state of the application server. All watchdogs are controlling backend servers within an IOM cluster, have to communicate with each other. They have to negotiate which of them is running the active application server. Also see Guide - Intershop Order Management - Technical Overview section High Availability.
The required communication is using the IOM database. There are only two operations, the watchdog has to implement:
Before Watchdog is allowed to start the application server (watched process), it has to ensure, that the server is the active one. BecomeActiveServer sets the current server (property watchdog.failover.serverid) in active-state, if all preconditions are satisfied.
If no server is currently active, the current server is made the active one. If current server is already active, the state will be refreshed. Exceptions must handled internally, since watchdog needs to remain working even if no database is available.
Returns true only, if current server became active or the active-state of current server was refreshed.
BecomeActiveServer has to ensure, that parallel execution by different watchdogs returns true to only one caller.
If Watchdog has already started the application server (the server became active before), its active-state has to be refreshed before it times out. RefreshActiveServer must not refresh the active-state, if it is already timed out. Exceptions have to be handled internally. Inconsistencies in data has to be handled too: if another server is marked active too, own server must not be refreshed.
Returns true only, if active-state of server was refreshed.
The following diagram shows the main aspects, how Watchdog is controlling the watched process.
The integration of Watchdog to IOM requires some additional processes.
Watchdog is started by systemd. Systemd gets all the required information from oms-watchdog.service, which has to be created from $OMS_ETC/oms-watchdog.service.template during installation. It has to be updated, whenever settings of OMS_HOME or OMS_ETC have changed).
Watchdog.sh has a built in a mechanism, to work without OMS_HOME and OMS_ETC too. If the variables are not set, watchdog.sh looks up set_env.sh relative to watchdog.sh. Watchdog.sh itself starts the Java based watchdog program.
The Java-based watchdog is the real watchdog program, starting/ stopping the application server and checking health and readiness.
Watchdog.sh is also responsible to set additional options at the Java-based watchdog. These options can be defined at property WATCHDOG_JAVA_OPTS in $OMS_ETC/installation.properties.
Also see Guide - Setup Intershop Order Management 2.2.
The Java-based watchdog starts standalone.sh and has to stop it on failure. But when trying to stop it, only standalone.sh is killed. The java-process started by standalone.sh remains alive. To overcome this problem (without the need to modify standalone.sh or accessing the internal Java class UNIXProcess) an additional program is needed, which starts standalone.sh and kills all sub-processes. This additional program is named standalone_starter.sh. Standalone_starter.sh only starts $JBOSS_HOME/bin/standalone.sh, but additionally provides a handler to be executed on exit, which kills the sub-processes.
The information provided in the Knowledge Base may not be applicable to all systems and situations. Intershop Communications will not be liable to any party for any direct or indirect damages resulting from the use of the Customer Support section of the Intershop Corporate Web site, including, without limitation, any lost profits, business interruption, loss of programs or other data on your information handling system.