PDF version of this page

PDF to XML Conversion

FormTrap's XML conversion is often used to convert PDF documents to input for computer systems as XML files. This was outside of the original specification, but is very simple to do, following five steps:

    Convert to Text using program pdf2txt

    Identify the sender using (V7) Split Rules

    Remove rubbish using (V7) Repaginator

    Use the standard Version 8 Text to XML functions to generate the XML file

    Drop the resulting files into a folder for inspection

Note that this applies ONLY to PDFs that carry text rather than graphics. You have the option of first converting all-graphic PDFs to searchable PDFs which inserts the required text using OCR methods. Results are NOT 100% in all cases for OCR'd documents.

Top

Conversion of PDF to Text File

You may either drag 'n drop the PDF to a Spooler Queue which runs the PDF to Text filter, then save the output from the filter, or run the program via a shortcut (or CMD prompt) from your FormTrap development system. The shortcut is shown, with parameters below. This delivers the current "Text from PDF.txt" file into your working folder. Check alignment and possibly vary the pitch (digits) in the command line.

This is the remainder of the "Target:" line (following program name):

   -fixed 10 -layout %1 "path\Text from PDF.txt"

where path is the full folder name.

Note: In the FormTrap Spooler, run PDF to Text conversion ahead of the Western filter.

Top

Identifying the PDF Document

Document identification quickly identifies the other party and document type. Update the Rule File used to distribute documents to their resepctive queues for processing.

Do this with Version 7 Rules files, program shortcut SplitDef as shown.

Rules normally comprise the party name and document type. Below are two entries (parties) in a rule file. Include all parties as individual entries in the same rule file. Rules (highlights) are all "equal to" rules.

Update the modified rules file to the FormTrap Spooler, set up the new queue to handle this party and change the Process tab of the queue with the split rule to send this party's documents to the correct queue.

See here for information on the Version 7 Split Rules program.

Top

Removing Rubbish

These and the following processes take place in the party Queue in FormTrap Spooler. This process may not be required, however if detail of a product can split across pages or those pages have a lot of inter-page redundant information, use the Repagination process to remove rubbish from the file ahead of XML conversion.

Repagination is defined using program shortcut RePageDef as shown.

Four elements are normally defined, as shown below, in this order:

   Header - first page (down to and including detail line headers)

   Trailer - document total (include everything, this is to avoid removing redundant lines later)

   Detail - page footer (with property, Suppressed ticked)

   Detail - page header for second and subsequent pages (also Suppressed)

   Detail required lines (define as non-blank) which keeps all non-blank lines.

Additional Details may be defined (ahead of the required Detail line for specific inter-page connectors such as "... continues").

Copy the modified, tested repagination ".rpg" file to the spooler - copy to folder "Repagination Rules" within the %fthome folder, if that folder is not present please create it (the address of the %fthome folder is shown in FormTrap Spooler under Setup, Core Components).

Use FormTrap Spooler's Setup, Filters to define this repagination, using the above rules file.

Run this filter, first in the party queue using the Filters tab as a Pre-identification filter.

Run the standard text to XML conversion in the party queue to generate your XML file and ensure you follow standards for the XML names to go into your systems.

See here for information on the Version 7 Re-Pagination program.

Top