Document Composition

Document layout
Document layout process starts by loading document template in its initial state. This usually means that the data DOM is unpopulated, variable fields are blank or contain default values and repeatable fragments have single instance.

Data acquisition process retrieves values from structured source like XML, JSON and populates data DOM.

Once the data is loaded, the content (<design>) part gets populated. In this process, the original XML or JSON file that supplied the data is no longer used. Elements in the document design are bound to the data DOM using a path statements. The element path is a string that describes location of element in the document. Data binding uses path syntax to address elements in data tree, eg /document/data/$invoice/$receiver/$mail. When fragment in document design is bound to repeated element in data eg /document/data/$invoice/$row then the content populator will create multiple instances of that fragment. The populator will not consider spacing issues, overflow or line breaking. It simply instantiates all child elements and fills the fields with text content. Optional field formatting rules are applied at this stage as well.

When the whole design content is populated, the document layout rules will be applied. In this stage the text formatting and line breaking happens and overflows get resolved. Overflow occurs when container fragment holds more child elements than can be fit into the available area with specified layout method. For example, the vertical top-to-bottom stacking places invoice rows into the container one by one and when it runs out of space, the overflow method gets invoked. Overflow may then duplicate parent container fragment and transfers the overflow content – child fragments that did not previously fit – into cloned container. In simple terms, new page is created and layout continues recursively.

Completely formatted document has all elements in their final state ready for output. The renderer loads a driver module and translates the content into graphical primitives like lines, rectangles, polygons, text, and images. Each primitive will end up as a call to the abstract rendering interface, so the renderer on this level works same way for every concrete output format. This allows recording the output from renderer and replay at later stage for multi-format output or for diagnostic purposes.

Scripting API
At each of above steps, the events are published to the JavaScript engine which runs document scripts. This makes it possible for the script code to modify the document and implement custom logic. The most common scenario is calculation of additional values after the document data DOM is populated.

The document script is a JavaScript file that in its minimal form looks like this:

((merge) => {});

The merge context object provides logging facilities, access to the current document and can be used for subscribing to the events. Following script will issue information message (level 2) and subscribe to the “data populated” event when script gets loaded.

((merge) => {
    merge.log.info('extension loaded');
    merge.subscribe('document:data:populated', () => {
        merge.log.info('data populated');
    });
});

The events triggered from the merge process are following:

  • document:loaded
  • document:data:loaded
  • document:data:populated
  • document:layout:processed

The rendering phase is not accessible to the script in this version. It is possible that the rendering step will be added here or eventually appears as part of another – process management – scripting model. Current scripting deals basically with the document itself and does not consider the processing steps outside the DOM. Such external events are for instance retrieval of data and document file, sending the resulting PDF by email or posting to REST API. They definitely deserve some attention, we just do not want to mix our apples with the oranges.

Traverse through document data DOM:

((merge) => {

    const reportAll = (items) => {

        merge.log.info('data items:');
        items.forEach((item) => {
            merge.log.info(` item.name: ${item.name}`);
        });
    };

    merge.subscribe('document:data:populated', () => {
        reportAll(merge.document.data.children);
    });
});

Lets consider this simple data DOM:

<data>
    <item name="InvoiceNumber" />
</data>

To modify data DOM values from the script after the XML data is loaded:

((merge) => {
    merge.subscribe('document:data:loaded', () => {
        merge.log.info(`InvoiceNumber = ${merge.document.data.$InvoiceNumber}`);
        merge.document.data.$InvoiceNumber = '123456';
        merge.log.info(`InvoiceNumber = ${merge.document.data.$InvoiceNumber}`);
    });
});

Output formats
The output drivers implemented in current version include PDF, PostScript, SVG, Windows printing and the internal document format. The latter may be used to serialize the formatted output as XML document which can be opened in UI tools, archived, rendered into PDF, PostScript or printed.

Command Line Tools
While we are working on integrated process setup and management system, the document composition tools can be executed as command line tools.

Extract
The extract command line application runs input file through filter pipeline to obtain Unicode text and processes the result with extractor. Extracted data is written into output as XML or JSON. Multiple stencil files can be used to configure the extractor.

Example:

extract.exe -in input.pdf -out output.xml -config .\stencils\*.stencil

Additional parameters:

    -in         [filename] input file
    -out        [filename] output file
    -mime       [mime] input mime type 
    -outmime    [mime] output mime type (application/xml | application/json)
    -config     [path] stencil file path, use pattern * for multiple files
    -log        [filename] name of log file
    -logappend  causes the log to be appended to existing file
    -loglevel   [1..5] where lower level shows more messages
    -bin        [folder] location of program data files
    -filters    [folder] location of filter files
    -wd         [folder] set working directory
    -dump       [folder] write input filter output (Unicode text) to file
    -version    displays version info

Merge
The merge command line application loads document template and data file. It populates the documents data DOM with values from XML or JSON, runs layout process and renders the result using one of the drivers: PDF, PostScript, Windows printing or XML.

Example:

merge.exe -in input.shape -data data.xml -out output.ps

Additional parameters:

    -in         [filename] input file
    -out        [filename] output file
    -print      [name] of Windows printer, not supported on Linux
    -mime       [mime] output mime type 
    -split      splits pages into separate files
    -data       [filename] data file
    -datamime   [mime] data file mime type (application/xml | application/json)
    -fonts      [folder] font folder
    -log        [filename] log file
    -logappend  causes the log to be appended to existing file
    -loglevel   [1..5] where lower level shows more messages
    -filters    [folder] location of filter files
    -bin        [folder] location of program data files
    -wd         [folder] set working directory
    -version    displays version info

The MIME type parameters may be omitted. In that case the automatic content type detection from filename extension and content header is applied. The detection is not always possible, especially when filename has no extension or particular file format lacks fixed header signature.

Following mime types are currently recognized by content type detection. Note that detection here means ability to recognize file type, not the the ability to use the file as input or output in any concrete scenario.

text/plain
text/html 
application/xml 
application/json 
application/pdf 
application/postscript 
application/raw 
application/rtf 
application/vnd.ms-excel 
image/wmf 
image/emf 
image/tiff 
image/jpeg 
image/png 
image/gif 
image/bmp 
image/svg+xml 
image/jp2

Following types are used for internal format and are specific to the application:

application/vnd.ws-doc
application/vnd.ws-stencil 
application/vnd.ws-extractor 
application/vnd.ws-settings 
application/vnd.ws-filter 
application/vnd.ws-flow
application/tcml