Glossary of Imaging
Terms
A
Accuracy
percent: Is used
to measure the numbers of characters correctly interpreted by a recognition
engine. Can be misleading as the recognition engine only reports the errors
that it fails to identify or that are caught through post processing (see
also substitutions).
Anchor
Points: Refers to crosses
or other marks placed in corners of documents to allow them to be consistently
lined up within a computer systems memory. This enables accurate finding
of data and lining up of templates.
Audit
Trail: A printed report
identifying where in the scanning process each document is located.
Autofeeder: A device which is either integral to or added on to a paper
scanner to accept a stack of paper and automatically feed pages. Autofeeders
vary in their ability to accept differing thicknesses, sizes and qualities
of paper. As paper transitions from thick to thin, double feeds can occur.
B
Barcode:
Consists of a series of thin and thick black lines that when placed in
defined patterns represent a numeric or alphabetic character. Various
different symbologies identify the defined patterns. Barcodes can be one
dimensional -- like the ones found on retail packages or two dimensional
(known as 2D). 2D barcodes, which consist of a matrix of black and white
blocks can contain large amounts of information. The most popular is PDF-417,
developed by Symbol Technologies.
Barcode
Recognition: Utilizing
a scanned representation of the barcode to interpret it.
Bitonal:
A term used to mean black and white images with no grayscale. Traditionally
the main way to capture and store images of documents in document management
systems.
Batching:
Collecting multiple pages together and separating with batch separators.
Batches are either fixed quantities of single pages which can be counted
to identify double feeds (see autofeeders), or consist of multiple levels
often based on three levels of index. Recently there has been some interest
in using color coded bars scanned with a color scanner to identify batches.
Batch Control
Sheets: Coded pages usually
with barcodes or OCRable characters that automatically separate
pages within a batch or separate batches.
Book
Scanning: Requires either
specialized scanners or for the spline to be cut off. Flatbed scanners
damage the spline and provide a fuzzy image at the edges.
C
Check Digit:
A mathematical formula that adds a digit onto a field. When the field
is captured, the check digit can be used to verify that the data was converted
correctly.
Collection
of Mail: A service offered
by some outsource vendors where mail is received on behalf of a customer
in a PO box which is routed direct to the outsource vendor.
Confidence
factors: Used by recognition engines to decide
the likelihood of the answer being accurate.
D
Data Color:
Refers to the color of the data that must be extracted and converted.
Carbonless paper can often produce a very faint image.
Data
Prep: A term covering
one or all of the following manual actions: the opening of envelopes,
unfolding of paper, removal of staples, repair of tears.
Double
Feed:
the feeding of two sheets of paper at once. Sometimes on roller based
scanners this can occur so cleanl;y that it cannot be detected.
DPI:
Dots per Inch. A measurement of resolution of the scanned image. normally
200 dpi is adequate to represent a mainly textual document. Much OCR works
better with 300 dpi, but this does NOT mean that it works even better
with even higher resolutions -- depending on the algorithms it can work
less well.
Drop-Out
Ink: Inks that are not visible to the light spectrum
of the scanner. Can either be pastels, particularly in the yellow/green
range or specific color inks that match the color of the light source.
New color scanners often include the ability to remove, or drop-out specific
colors. Users want to drop-out background colors in order to capture the
foreground information so as to apply OCR or some other recognition to
it.
Duplex Scanning:
The ability to scan both sides of a piece of paper in one pass.
E
Edit Checks:
refers to the validation of types of fields. for example a field can be
numeric only, alphabetic only or a specific pattern.
Endorser:
usually provided with a programmable ink-jet, provides a method of printing
on scanned documents to ensure that all the pages are scanned. Also provides
a method to find specific pages.
F
False
Positives: A term used
in OCR to denote those characters which the conversion engine thought
were wrong but were in fact correct. false positives tend to rise if the
engine accuracy requirements are set too high (see also substitutions).
Fire
Damage: Causes charring of paper and can cause
degradation or destruction of image. The image can sometimes be reconstructed
electronically. Also can make paper very brittle which means that a straight
through scanner should be used to create the images.
Fire Protection:
In paper intensive environmentsdry extinguishers should be installed in
outsource vendor. Standard sprinklers cause paper damage (see water damage).
Flatbed Scanners:
Scanners that contain an autofeeder and a piece of glass where the paper
can be placed and scanned. Can be useful for certain non-standard papers,
but is slow and not good for production scanning (see transport).
Form Colors:
Normally refers to the overall color of the form which can have an impact
on image quality. For example a black or blue image placed on a dark pink
or red background will not provide adequate contrast on a black and white
scanner. Form colors can also refer to the color of the background form
(see drop-out ink), or to the color of the data image (see data color).
Form Redesign:
Refers to the ability to improve the automated processing of the form
through redesigning. Should be carried out in conjunction with the service
provider.
I
ICR:
literally Intelligent Character Recognition. Initially used as a term
to differentiate Kurzweils OCR from other vendors products.
Recently come to mean hand print recognition. Usually related to neural
net technologies, can be used also to identify marks such as check-off
boxes or stylized pattern fonts such as OCR-A, OCR-B or MICR.
IDR:
A term used to denote intelligent document recognition. usually relies
on full text OCR of a document the results of which are then used to analyse
the content of the document and extract relevant fields of information.
L
Levels of Index:
(see also batching). Documents may be filed by cabinet, file,
and folder. This represents a 3 level index.
M
Mainframe:
often needed to provide validation tables which may be down loaded. Service
provider must be able to provide data and images in readable format on
acceptable media.
Microfilming: refers to
the ability to capture images on microfilm concurrently with digital media.
Can be useful for human readable archival data.
Missed
Scan: see double feed.
O
OCR:
Optical Character Recognition. A method of using pattern recognition of
images of characters to create computer readable data. different OCR software
works better than others on certain types of data.
Off-Shore:
the ability to send images or paper for manual intensive key entry to
low cost locations. Historically these were located in the Caribbean as
it was easy to fly documents there and the time zone is the same as for
the East Coast. Now, though with the advent of low cost communications,
off-shore service bureaus are springin up in India, Sri Lanka, China,
Philippines, Mauritius, Zimbabwe and other english speaking locations.
OMR:
Optical Mark Recognition. Sometimes called mark sense. Conversion of check-off
marks to meaningful data. Simple and accurate way to capture survey type
information automatically from people.
Overhead
Scanners: Similar to planetary
microfilm cameras, these scan a page placed on a platen (see also book
scanners).
P
Paper Size:
varies from business card size to 11x17 in business documents.
Paper
Weight: varies from 9lb
onionskin to 120lb cardstock.
Post
Office Boxes: can be used
to speed the input and collection of data (see collection of mail).
R
Reflectance:
Refers to how much the ink and background paper reflect the light within
the scanner. Affects the quality of image.
Repair:
refers to the manual keyboard correction of characters wrongly converted
by OCR or ICR.
S
Scanning Paper:
the conversion of a page to a digital representation. Normally a page
is broken into 200x200 or 300x300 dots per inch (dpi).
Schema:
the defined layout for a specific business document using XML syntax.
Set-Up:
the process of creating a new job.
Skew:
the angling of the paper which can cause failure of OCR. some scanners
will angle small paper badly.
Substitutions:
Traditionally the most expensive errors to correct. Consists of those
characters that a recognition engine is convinced it got right but that
are in fact wrong. High levels of 'accuracy' reported by an OCR engine
can mean that there are many substitutions. The alternative is to set
tolerances very high -- then the engine will often report low accuracy
-- but there may be many correctly interpreted characters which are labelled
wrong.
T
Transport:
the method by which the paper is moved past the digitizing scanner. Affects
speed of throughput and types of paper.
Tumble
Printed: refers to those
double sided papers that get turned over from top to bottom. Requires
duplex scanner to rotate 180 degrees.
V
Validation:
performed against totals or against downloaded tables to ensure accuracy
of data (see mainframe and also edit checks).
Verification:
the only proven way to ensure 100% data accuracy as opposed to 99.x%.
Requires the rekeying of all data by a separate party.
Voting:
a method of improving recognition through the use of multiple recognition
engines, voting on the result voting can be internal or external -- internal
voting tends to be preferable as the engines have reference to the internal
confidence factors.
W
Water Damage:
causes images particularly if hand written to bleed. Image can be recovered
with sophisticated image processing.
X
XML:
eXtensible markup language provides content and structure for B2B based
forms through allowing fields and structures to be tagged and layout to
be enforced.
XSL:
eXtensible style language defines the styles associated with XML files
XSLT: EXtensible Style Language Translation allows for XML formatted documents
to be automatically transalated and reformatted. |