Data Extractor

Stencil Editor

Application
The Stencil Editor is a desktop application developed using C# and the Windows Presentation Framework. It allows for the simultaneous loading of both the source document and stencil into a single user interface. This setup facilitates the definition and testing of data extraction rules against actual sample data. Each extraction rule can be executed individually, or the entire extractor can be run at once. The user interface maintains a complete undo/redo stack for all stencil editing operations. Additionally, the Stencil Editor setup includes filters for importing files in PDF, HTML, and XLS formats, enhancing its versatility.

Stencil
Stencil are XML-based documents that contain one or more ruleset elements. Rulesets can be nested, with child rulesets serving as optional parts of the root level element. A valid ruleset contains one or more items, and the set of items in a ruleset defines the structure of the data that will be extracted by the extractor service from the input document. Each item has one or more matching patterns, with multiple patterns serving as alternative searches.

The ruleset includes a collection of condition elements that define the document type. Multiple conditions are combined with logical operators, and the resulting boolean value determines if the stencil is applicable to a particular input data.

Conditions and patterns utilize the regular expression language, allowing for flexible data matching. Regular expressions are a well-established method for searching for patterns in textual data, so many people in the IT industry are already familiar with the syntax. The OctoDOC Stencil concept builds upon regular expressions but implements an additional structural layer that makes it possible to define complex extraction rules by combining multiple small and simple search expressions.

Publisher
The OctoDOC Editor includes a Publisher component as part of the user interface. The Publisher can be used to connect to the live extractor service to manage the list of available stencils on the server that provides a REST API. When uploaded, new stencils become immediately available, optionally on multiple service instances backing the single REST API entry point. The publisher allows for the uploading of metadata and attachments. For instance, common example files can be kept on the site along with the stencil.

Stencil tools
The OctoDOC Editor includes convenience features like a regular expression generator and pattern library, which assist in building new stencils from existing, tested parts. Theoretically, it’s not possible to deduce the regular expression from a single sample value. The generator allows users to create a list of regular expression parts that are tested against the sample data. When a suitable predefined construct is found, it will be suggested as a new pattern.

The Windows version of the extractor core is installed along with the OctoDOC Editor to provide a preview of the stencil execution for testing and development. Clicking the “Test” button in the OctoDOC Editor will launch the extractor runtime as a background process and present the JSON data retrieved from the sample content.

Deployment
The Stencil Editor installer can be downloaded for 64-bit Windows and it includes 64-bit Windows runtime (extractor.exe). Separate runtime tools package also includes extractor command line version.

The extractor runtime is deployed in Docker container that runs Linux operating system. Running instance of the extractor can be managed via REST API or by using built-in management web page. The management interface allows adding stencils and observing execution logs. Each extractor invocation produces a log entry that includes result code (none, partial, complete). It is optionally possible to retain input files and use collected samples for developing new stencils.

Extractor can also be invoked as command-line application on Linux and Windows.

Stencil format
Stencil file contains part of configuration for extractor application, stored in XML format. When extractor loads stencil, it will add the rulesets from the stencil in to the runtime configuration. Extraction process tests rulesets against input document one by one, in the order as they appear in list. When the ruleset matches document, the extractor will apply the ruleset to the input data.

It is possible to control in which order the extractor uses stencils. For that purpose there is Order attribute at the root level of stencil document. When Order is set to any positive integer, the value is the index of stencil in the extractors stencil list (or last, when index is larger than count of pre-existing stencils).

When Order is set to -1 (minus one, default value) the stencil will be added to the end of the list and any following stencils (if any) will be added after it. Use the -1 when the position of stencil is not important.

To keep the stencil at the end of the list so it is always applied after all other stencils, the value -2 (minus two) can be used. Stencil with order -2 will be always kept at the end of the list. When there are multiple stencils with order -2, their order relative to each other is implementation specific.

Stencil contains one or more units called ruleset.

Ruleset
Ruleset defines number of rules for detecting certain document type and for extraction of data items from document content. Rulesets can be nested, in which case child rulesets serve as optional parts of root level ruleset. To be valid, the ruleset must contain one or more items. The set of items in ruleset define the structure of the data that will be extracted by extractor service from input document.

Constraints
The ruleset may contain constraint elements that define the match criteria for the ruleset. Multiple rulesets are combined with logical operators and the resulting boolean value causes the ruleset to be applied or skipped.

Constraint elements are used for detecting the document type. Ruleset without constraints will be always applied to any type of input. In the presence of large number of rulesets, significant amount of processing may be needed in order to get the best match. Constraints make it possible to narrow down the number of rulesets that get applied to the input content.

Constraints can be nested and combined with logical operators AND, OR, AND NOT, OR NOT. This makes it possible to create complex match conditions. Constraint is similar to pattern, except that it does not extract any data, there is no format and data type attributes. Result of condition is Boolean value true or false. All constraints combined will also produce single Boolean value. When matching process (applying all constraints) yields a value false, the extractor will move to the next available ruleset in the list. In opposite case, the items in ruleset get evaluated.

Items
Item is the named entity similar to input field in electronic form. It has unique name and its value is extracted from the content according to the match patterns that it contains. Item may have multiple match patterns and each pattern may extract multiple values from single document. Any value that the pattern evaluation produces is added to the candidate list. When all patterns are processed, the selection rule will be applied to choose appropriate value. The rule may be one of the predefined types: first, last, largest, smallest, or all.

Patterns
Item may contain one or more pattern elements. The pattern element’s content is the regular expression that extracts input string from the document. The expression may contain multiple capture groups, in which case the group number attribute will be used to select the one that is treated as input string. Group number is 1-based index and value 0 means that all text captured by regular expression will be used.

When the evaluation of regular expression produces non-empty result, the item will get record new candidate value. Before recording the value may be normalized, depending on the data type and format.

The extracted value data type may be: generic, text, date, or numeric. The generic is used to extract values that do not have specific formatting rules but usually represent single keyword (like invoice number, ZIP code). The text is used for larger free text items. Neither generic nor text have any formatting applied to the extracted value.
The date and numeric values are normalized according to their format and language attributes. The date value returned by extractor is always in locale-independent format yyyy-MM-dd, for example “2017-12-01”.

For instance, consider the input value “December 01, 2017”. When the pattern specifies format “MMMM dd, yyyy” and language “en”, the de-formatting will retrieve the standard “2017-12-01” from that input.

Regular expression
Regular expression is a sequence of characters that define a search pattern. In simplest case the pattern is nothing more than a string that we are searching for in the content. Complex expressions however allow creating mini-parsers that can handle very complicated matches. Basic concepts of regular expressions are described in the Wikipedia article.

The Stencil Editor executes regular expressions on the sample document. When editing the expression property of constraint or pattern, press enter to execute it. The result – if any – will be highlighted in the sample view. In case multiple capture groups are defined, the active group will be underlined. Active group is selected by specifying non-zero group number.

NOTE: Date and numeric format are not applied in the Stencil Editor for preview in current version. The “Test” function however runs the extractor command-line app in background, so the test reflects actual runtime result.

Date format
Format specification is used by extractor to normalize the date values so they could be stored in uniform representation. The syntax of date format itself is also language independent but the application of the format on input data depends on language code set on pattern element.

For example the format “MMMM dd, YYYY” when applied to input string “March 12, 2020” expects language code “en”, so the month name is localized. When there is need to parse German language input, then the language code must be set to “de”. The same pattern then works with input “März 12, 2020” but will fail to parse “March 12, 2020”.

When there is no way to structure the stencil using different ruleset for English and German input, then another option is to define alternative pattern on same item. Multiple patterns will be applied in order they are defined and each pattern has its own language, format and extraction expression.
Date format used by extractor is described here: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1DecimalFormat.html

Numeric format
More about numeric format patterns can be found here: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1DecimalFormat.html

Language code
Language code is an attribute of pattern element. Internal attribute stored in the stencil file is ISO 639-1 two-letter code for the language. Processing of the format (date and numeric) depends on the language.

Stencil Editor

Contact

Address