AutoCorrection

The most powerful feature of the TableExtractor is AutoCorrection. You can specify text templates for each column you have to extract. If the OCR process has made a mistake (e.g. by misinterpreting a 0 (zero) as an O (alpha numeric)) the tableextractor can correct that character.

Text patterns are described in QRX notation (similar to regular expressions). The following manual gives you an overview over QRX.

The QRX Manual

QRX offers an error tolerant way to match noisy or corrupted text structures against a specified text template. Therfore QRX uses a certain way of specifying text structures: The QRX notation. Like other forms of templates (e.g. regular expressions in Perl) it can be used to express your knowledge about a certain text information.

The following example may point out what is meant by this:

Assume we have an OCR generated text: My phone mumber is 030 2323. If we are searching for phone number with an conventional tool we won’t find anything because the is an m in the text instead of an n in the search text. With QRX we will find this match and we can decide if this error is acceptable.

Getting started: Matching basic strings

In the introduction we have matched two basic strings. This is quite simple to do with QRX:
 

QRX Text to match Error
phone number phone mumber 1 character


The common character classes

In many cases we have knowledge about the type of characters (number, upper character, lower character, special character) which appear in the text. QRX offers a simple way to express that fact. There are 4 common classes:
Class Usage
N Numbers (0-9)
T Upper letters
t Lower letters
S Special characters

Example: An US zip-code consists of 5 numerical characters ( “88123”).
 

The QRX syntax for one single number is “N”. Therefore a zip code can be expressed by “NNNNN”! 

QRX Text to match Error
NNNNN 88123 0 characters

Example: A date (german style) is written like “12.02.1976”. We use a combination of native elements (here “.”) with character classes:
QRX Text to match Error
NN.NN.NN 12.02.1976 0 characters


Another example:

QRX Text to match Error
My (first|second|third) beer My second beEr 1 character


Alternatives

In QRX we have the ability to give several alternatives to a notation. The alternatives are separated by the character “|”. 

QRX Text to match Error
Jan|Feb|Mar|Apr Fed 1 character

Sometimes it is necessary to assure that the OR-Separator relates to the right terms. Therefore brackets can be used. Example: We have dates where the year information can be consists of either 2 numbers or 4 numbers. Without brackets we get:
QRX Text to match Error
NN.NN.NN|NNNN 12.02.1976 2 characters

Using brackets we get:
QRX Text to match Error
NN.NN.(NN|NNNN) 12.02.1976 0 characters


Length Intervals

Up to this point all templates had fixed length. In many cases we have to match against information with a varying size but with known content (e.g. 5 to 10 upper letters or 3 to 5 numbers). With QRX we can use intervals after the character classes to express this fact. Therefore we define the minimum and the maximum size in curly brackets.
QRX Text to match Error
My t{5,12} beer My ninth beer 0 characters

 
If there is more than one character class to find in this interval, we can use squared brackets to express that.

QRX Text to match Error
My number is [01]{1,10} My number is 001x0001 1 character


The De-reference operator

As we already know there are existing several characters with a special meaning in the QRX syntax (“N”,”t”,”T”,”S”,”[“,”|”,..). If we want to use one of these characters as a native character, we have to insert the “@”-character directly before the relating letter.

Example: The upper “N” in number is interpreted as an number-class! The result is an 1 error distance! To solve this problem, use “@N”!

Annotation: This effect is not very common, in almost very case it is not necessary to look for those constellations!

QRX Text to match Error
myNumber myNumber 1 characters! Because N is a symbol for numeric values!

With @-operator we get:

QRX Text to match Error
my@Number myNumber 0 character


Notation

Expression Match

@x    

X|Y         

(X)       

X{n,m}   

[abc] 

N             

T          

S

 
The character x, if x is no spezial character.

The character x (De-Reference)

X or Y

Group

X with minimum n, maximum m repetitions (0 < n <= m).

Characters a, b or c

0 to 9 (Numeric)

A to Z (upper terminals)

a to z (lower terminals)

Special Characters:
ASCII 1-31, 33-47, 58-64, 91-96, 123-255.

Equals PERL notation: [^0-9A-Za-z\x7B-\x80]


Reserved Characters

QRX Version Reserved Character
1.0 (){}[]|,NTtS@;


Comparison to regular expressions in PERL

QRX is a minor set of regular expressions. The following expressions are not available in QRX:

Not supported Perl Term Description

X*

X+ 

X?

^

$

.

[a-z]

\n

\t

\d

X any times

X at least once

X or none

Start of Line

End of Line

Any Character

Characters 'a' to 'z'. Note: In QRX notation:t.

New line.

Tab

[0-9]. Note: In QRX notation:N.

 

Examples

QRX Expression Examplematch

Hello?   

@Telephone: N{3,5}/N{3,20}    

Da@te: (N{1,2}.N{1,2}.N{2,4}|N{4,4}/N{2,2}/N{2,2})   

Vorname: [Tt]{2,20}, @Nachname: [Tt]{2,20}.     

Produc@t code: TTT-N{5,5}-[ABCDEFN]   

Hello?

Telephone: 0123/45678

Date: 2004/03/08

Vorname: Fritz, Nachname: Mueller.

Product code: IBN-12345-3 

QRX expressions in Backus-Naur

Term  
QRX Selection
Selection Concartenation '|' Concartenation '|' Selection
Concartenation '(' Selection ')' | Concartenation Concartenation | CharacterSpec
CharacterSpec ( '@' AnyCharacter | 'N' | 'T' | 't' | 'S' | CharacterClass | NonSpecialCharacter ) ( AreaSpec )?
CharacterClass '[' ( CharacterClassItem )+ ']'
CharacterClassItem '@' AnyCharacter | 'N' | 'T' | 't' | 'S' | NonSpecialCharacter
AreaSpec '{' PositiveInteger ; PositiveInteger '}'
PositiveInteger [0-9][0-9]*