Personal Library

Concept - Impex Framework

Introduction

The Intershop Commerce Management (ICM) import/export framework (short term: ImpEx framework) consists of specific Java classes, pipelets, and pipelines for importing and exporting data. The framework is used by the standard ImpEx wizards that guide ICM users through standard import or export operations. Developers can use the framework to extend the functionality or to customize existing import and export operations. The ImpEx framework is pipeline-oriented. Developers can use the existing pipelines to do the following:

Customize the pipelines by replacing single components as required
Use existing pipelines to develop import and export processes for custom data sets

The import functionality is based on the Simple API for XML (SAX) parser combined with JAXB for parsing XML files. For standard database imports, ICM uses the ORM (object relational mapping) layer to bulk mass data into the database. The export functionality uses the template language ISML to serialize persistent objects.

The following functionality with corresponding API is provided:

Logging
Configuration
Monitoring
ISML template execution (used for export processes)
Parsing, validation, bulking (used for import processes)
File handling
Filters
Multi-threading
Security

Glossary

Term	Description
ImpEx	Short term for import and export.
Controller	Central object providing general functionality like logging, statistics, configuration, ...

Overview: Import Process

A typical import process involves the following steps:

Parsing each element in XML file and creating transient import objects
Validation and complement of transient import objects
Bulking of data into the database

There are three thread groups. The first parses the file, the second validates and complements the parsed objects, and the third thread group is responsible for writing the data into the database. Parsing, validating, and bulking run in parallel. Validating and bulking in many cases can be parallelized using multiple threads to ensure high performance during import.

Some import processes, like for products and catalogs, require duplicate execution of these three steps, because:

There are ring dependencies between the objects within the import file and/or database.
Elements at the beginning of the import file can reference other elements at the end of the import file, making the file too large to hold it transient in memory.

In the first phase, the raw data without relations are imported. Afterwards, when all objects exist in the database, the import is executed again to import the relations between imported objects.

Overview: Export Process

Every export process executes the following steps:

Creation of an iterator of objects that need to be exported
Creation of an output stream (normally simple files in shared file system)
Serialization or marshaling of objects in this iterator using ISML into the output stream

Directory Structure for Import/Export Files

All import/export data files, such as sources, export files, configuration files, templates, etc., are stored in the IS_HOME/share/sites/<site>/unit/<unit>/impex directory. When uploading an import file, ICM transfers it from the source location to the appropriate location in the ImpEx directory. Similarly, when exporting data, you can use the back office wizard to download export target files from the respective ImpEx directory location.

Subdirectory	Description	Example
`archive`	To store previous import/export files.	n/a
`config`	Configuration files for import/export processes.	`DBInit-CatalogImport.properties`
`export`	Export target files.	ProductExport.xml
`loader/*`	Obsolete directory for Oracle SQLLoader.*	n/a
`log`	Logs created by parser, validator, and bulker are stored in this directory.	ProductImport.log
`src`	Import data source files.	Products.xml
`temp`	Temporary files.	n/a

Schema files (XSD) location:

Before ICM 11, the schema files (XSD) for corresponding imports are located in <IS_HOME>/share/system/impex/schema.
Since ICM 11, the schema files (XSD) for corresponding imports are located in <IS_IMPEX_SCHEMA> (which is /intershop/impex/schema in the container).
Since ICM 11.11.3 and 12.2.1, it is possible to additionally provide your own schema files for the import by placing the custom schema files in your cartridge within <cartridge_name>/src/main/resources/resources/<cartridge_name>/impex/schema.

Configuration

Pipeline Configuration

The complete configuration and controlling of import and export pipelines is handled by the Controller object. The Controller is created within the DetermineConfiguration pipelet, which determines the pipelet configuration and stores the Controller object in the pipeline dictionary. The Controller is the only pipeline dictionary object that is passed between pipelets to access their configuration-specific and import/export-specific services (for example, logging, progress notification, interaction, and event polling). The first action within an import/export pipeline must be the creation of the Controller object by the DetermineConfiguration pipelet. All other import/export-specific pipelets depend on it. Calling an import/export pipelet without an existing Controller causes a severe error.

Configuration values are global or local (pipelet-specific) and can be stored in multiple ways:

Pipelet configuration value in pipeline
The Controller retrieves the pipelet configuration value for the given key.
Global Controller property
The Controller retrieves the property from a configuration file when the DetermineConfiguration pipelet is executed. Global means that the looked-up key does not have pipelet-specific extensions.
Local (pipelet-specific) Controller property
The Controller retrieves the property from a configuration file when the DetermineConfiguration pipelet is executed. Local means that the looked-up key is extended with the pipelet identifier; for example, if the key is LogFacility and the pipelet name is Import, the Import.LogFacilitykey is looked up.
Global pipeline dictionary value
The Controller retrieves the property from the pipeline dictionary. Global means that the looked-up key does not have pipelet-specific extensions.
Local (pipelet-specific) pipeline dictionary value
The Controller retrieves the property from the pipeline dictionary. Local means that the dictionary key is extended with a pipelet identifier, e.g., Import.DefaultImportMode.

When the Controller encounters a configuration value more than once, it uses the order of precedence displayed above for the decision which one to use; later-found values supersede earlier ones. You should always use Controller methods to access configuration values. Doing so allows you to change the configuration at runtime. Configuration values can be a pipelet configuration property value, a Controller property value or a pipeline dictionary value. To distinguish between different pipelets of the same kind, each pipelet must be configured with a unique pipelet identifier within the pipeline descriptor.

Import Modes

The specific processes executed during an import are determined by the selected import mode. If no mode or an invalid mode is set, the mode OMIT is used by default. The following import modes can be set for importing data:

IGNORE
Ignores all objects that already exist in the database; creates records only for new objects and adds them to the database. For example, if a product is specified in the import source and the product is found in the database by the import/export service, it is not modified.
INITIAL
Performs no database query to find any existing objects. This allows a quick import but causes an error whenever an object is imported that already exists. This mode is normally used during the DBInit process.
UPDATE
Updates existing objects and creates records for new objects. Attributes and objects that do not exist in the import file are kept untouched.
REPLACE
Replaces existing objects and creates records for new objects. Objects that do not exist in the import file are kept untouched. Missing attributes of existing objects (in the import file) will be removed.
OMIT
Does nothing. This can be useful for tests.
DELETE
Deletes the specified objects from the database.

Import Modes and Import Performance

The import mode has a significant impact on overall import performance. When deciding on the import mode, take the following considerations into account:

The mode INITIAL is the fastest. It should be used whenever the objects to be imported are not already contained in the database.
The UPDATE mode is faster than the REPLACE mode.

Setting the Import Mode

There are two ways to set the import mode:

In the Commerce Management application, when setting up the import process
Selecting the import mode in the Commerce Management effectively sets a respective property on the Import pipelet of the respective import pipeline.
As an attribute of the business object’s XML representation that precedes the pipelet property
To enable mixed mode imports, the import mode may be specified within the XML source file as an attribute of the business object’s root tag and selected child tags. For example, a product can be imported in UPDATE mode while its category assignments are imported in REPLACE mode at the same time to remove outdated assignments to categories as well. The attribute name is import-mode and modes must be specified in uppercase, for example: import-mode = "IGNORE". Check the respective schema definition for details.

Logging

Import/export uses Logger objects that you create using the CreateLogger or CreateFileLogger pipelets (you can create as many Logger objects as you need). Once created, a logger is registered at the Controller object with a certain name that you can use to obtain the logger from the controller. A logger also has a log level assigned, e.g., debug, error or warning levels. Log levels can be combined using an OR operator. The base class for loggers and a ready-to-use NativeLogger (logs to stdout) and a FileLogger (logs to a file) are defined in the core framework.

Progress Notification

Gathering statistics for progress notification to keep track of the current position within the import process, statistics objects of type XMLStatistics are used. The statistic is accessible through the Controller and for the "outer world" through the import interactor. In general, the statistic is created by parsing the XML source and storing the number of elements. By incrementing a counter for a certain element, it is possible to get the current count of processed elements. The statistics object is created in the CollectXMLStatistics pipelet. Usually, the parse content handler is in charge of incrementing the counter.

Locking

To prevent multiple imports of related objects (e.g., products) into a single unit, an import process can be locked by using the LockImport pipelet. The LockImport pipelet uses the locking framework, see Concept - Locking Framework.

Several import resources exist, which can be used to lock a certain import. The main resource for import is named Import. All other import resources use this resource as parent resource.

Import Specific Resources

The resources are locked in a unit-specific way by configuring the LockImport pipelet accordingly. The following resources are available for import:

UserImport
CategoryImport
ProductImport
DiscountImport
PriceImport
ProductTypeImport
VariationTypeImport
OrderImport
...

General Database Resources

Some parts of import pipelines change the database content (e.g., the Import pipelet). Those parts must be locked to prevent concurrent database changes on equal tables (e.g., data replication of products vs. product import or product imports in different domains). The LockImport pipelet locks matching database resources for those tasks (meaning children of the Database resource). As a sample, the product import locks the Products resource before running the Import pipelet for product tables. The Products resource is also used by replication processes, so no replication and import process can run concurrently.

This pipelet locks the given resources in order to avoid concurrent operations on the same resources. The resources are specified by ResourceList containing a semicolon separated list of resource names that have to be available in the table RESOURCEPO. If parameter IsDomainSpecific is set to true, resources are locked only in the current domain. Due to this it is possible to start the same pipeline in different domains concurrently. If no resources are specified, the pipelet acquires the Database resource (system wide). So, no other import, staging or process requiring the Database resource or its sub resources will be concurrently executed. If one or more required resources could not be acquired, the pipelet returns with an error. The import process, holding the acquisition, is read from the pipeline dictionary. If no process is found, a new process is created. The acquisition made is stored in the pipeline dictionary.

File Locking

The pipelet AcquireFileResource is responsible for locking a file virtually to avoid conflicts with other processes working on the same ImpEx file.

Resources to acquire must be passed in either as a list of resources from the dictionary or a list of resource names from the configuration or dictionary. The pipelet stores the acquisition result and the acquisition itself in the pipeline dictionary.

Unlocking

Each LockImport pipelet must have a corresponding UnlockImport pipelet to release the locked resources.

Import

Import Implementations

There are three different implementations to bulk data into the database:

Oracle SQL*Loader
Used in former releases of Enfinity Suite, Enfinity MultiSite, and Enfinity 1.0-2.2. Due to improved JDBC driver implementations as well as new features in the ORM cartridge, this implementation is obsolete.
JDBC
Used in former release like the SQL*Loader. New improved implementation of ORM cartridge supersedes the direct usage of JDBC.
The ORM layer
This implementation is used by all standard import processes of ICM.

High-Level Object Diagram

The standard import processes use the ORMImportMgr to set up the import environment including queues, threads and so on. The goal is to write a pipeline calling particular import pipelets with corresponding configurations. Each import process has to configure the business object-specific XML parser ( XMLParseContentHandler), validator ( ElementValidator) as well as bulker ( ElementBulker). Normally, this is done in the according pipeline. The resulting object diagram looks like this:

High-Level Pipeline Design

For every business object that can be imported, a processing pipeline exists. The name of the pipeline follows this pattern Process<BusinessObject>Import, for example ProcessProductImport. Each pipeline has the same basic design:

The sub-pipeline Process<BusinessObject>Import-Validate parses the import file.
The sub-pipeline Process<BusinessObject>Import-Prepare takes care of the initial data conversion, parsing, and XML validation processes.
The sub-pipeline Process<BusinessObject>Import-Import then executes the actual import process.

Tuning Tips

Configuration of Existing Import Processes

The following properties can be added to the import property file to tune the process:

Global property key	Import property key (prefix: `[DictionaryString,<pipelet id>]`	Description	Default value	Location
`intershop.import.bulker.orm.batchSize`	`<prefix>.BatchSize`	Number of import elements that should be batched together to the database.	100
`intershop.import.bulker.orm.commitSize`	`<prefix>.CommitSize`	Number of import batches (see above) that should be committed in the database.	100
	`<prefix>.Validator.NumberThreads`	The number of validator threads.	1
	`<prefix>.Bulker.NumberThreads`	The number of bulker threads.	4
`-Xms1024m` `-Xmx2048m` `-XX:MaxPermSize=400m` `-XX:NewRatio=8`		The size of JVM. Increasing the size of JVM may improve cache efficiency.		IS_HOME/bin/tomcat.sh
`"Xloggc:$IS_HOME/log/gc$SERVER_NAME.log -XX:+PrintGCDetails"` `"-verbose:gc -XX:+PrintGCDetails"`		Tuning of the garbage collector.		IS_HOME/bin/tomcat.sh
`intershop.import.bulker.orm.importedToExistingProductsRatio`		Defines a percentage (%); if the ratio of imported products/offers compared to the current products/offers in the import domain exceeds this threshold, the following actions will be performed: a full rebuild of all product search indexes, a full clear of product cache entries. If the import does not exceed the threshold, then only incremental search index updates and cache clearing is done.	10	IS_SHARE/system/config/cluster/appserver.properties

Mass Data Imports

If the import process needs to bulk mass data into the database, the following aspects need to be considered:

The XML parser is not able to parse the XML file in one step (e.g., DOM, standard JAXB), because the application server JVM may not be huge enough. Therefore, the developer should use a SAX parser to avoid huge memory consumption.
The REPLACE mode should avoid DELETE followed by INSERT statements, because they are very expensive in the database. Objects should only be removed if the import file really requires it.
Mass data import needs to be split into data and relation import, because XML elements at the beginning of a huge import file may reference elements at the end of the same import file. These references can only be validated if the entire file is parsed at least once. Since not all elements can be cached in memory, the standard import writes them into the database.
Avoid changing a database object twice. This means that most import processes parse the import file twice to separate the data from relation bulking. The relation import should not import general data.
If a huge amount of data is imported into an empty table, the import process may slow down, because the database statistics would be wrong after, e.g., 10.000 import objects. That is why the base class for each bulker ElementBulkerORM automatically triggers analyzing database tables during the import process.

Export

All standard export processes use the ImpEx framework to provide basic functionality including ISML templating, logging as well as monitoring. So, each export process consists at least of a pipeline with the following start nodes:

Prepare:
1. Create the controller object -> pipelet DetermineConfiguration
2. Create the file logger -> pipelet CreateFileLogger
3. Create the pageable -> pipelet GetPageable
RunExport:
1. Open the export file -> pipelet OpenFile
2. Open the filters (e.g., formatting) -> pipelet OpenFilter
3. ISML template processing -> pipelet Export
4. Close filters -> pipelet CloseFilter
5. Close export file -> pipelet CloseFile
CleanUp:
1. Close loggers -> pipelet CloseLoggers

Due to serialization into the file system, the multi-threading approach does not make sense, because the shared file system is often the bottleneck.

Disclaimer

The information provided in the Knowledge Base may not be applicable to all systems and situations. Intershop Communications will not be liable to any party for any direct or indirect damages resulting from the use of the Customer Support section of the Intershop Corporate Web site, including, without limitation, any lost profits, business interruption, loss of programs or other data on your information handling system.

Table of Contents