--ocr

The ocr option generates a hidden text layer using the third party OCR engine that ships with Document Express Enterprise.

Use the ocr option in conjunction with the djvujoin or djvubundle command to perform OCR on a single-page or multiple-page DjVu® document: For example:

djvujoin --ocr singlepage.djvu newsinglepage.djvu

djvubundle --ocr multipage.djvu newmultipage.djvu

 

Options Taken by the ocr Parameter (a Comma-Separated String)

char – Controls the lowest level of the text. By default, the text layer is grouped by words (as identified by the OCR engine). If this option is selected, the text layer is grouped by characters. This option is necessary for many languages, such as the CKJV languages.

nosep – By default, separators are inserted between syntactic elements (as in English). If this option is selected, no separators are inserted. The presence of separators affects text searches since separators (or lack of them) in the search pattern are required to match those in the text layer. Necessary for CJKV languages.

lang – Language(s) to be recognized. The default is English. This option can be written in two forms. To specify a single language:

lang=Japanese

If multiple languages are to be recognized, write the string in the following form:

lang=(Japanese,English)

Each language specified should be either the name of the language or its number from the list of  #defines below.

mixed – Enables the mixed Asian-Latin reading mode (see the IRIS documentation).

NOTE: The --ocr parameter cannot have embedded spaces, even if the string is enclosed in quotation marks. This appears to be a Windows OS restriction.

 

Foreign Character Sets

If you want the IRIS OCR module to recognize a foreign language character set, you can specify the language string. The following are valid language strings:

#define  AMERICAN  1

#define  ENGLISH  1

#define  GERMAN  2

#define  FRENCH  3

#define  SPANISH  4

#define  ITALIAN  5

#define  BRITISH  6

#define  SWEDISH  7

#define  DANISH  8

#define  NORWEGIAN  9

#define  DUTCH 10

#define  PORTUGUESE 11

#define  BRAZILIAN 12

#define  GALICIAN 13

#define ICELANDIC 15

#define GREEK 17

#define CZECH 18

#define HUNGARIAN 19

#define POLISH 20

#define ROMANIAN 21

#define SLOVAK 22

#define CROATIAN 23

#define SERBIAN 24

#define SLOVENIAN 25

#define LUXEMB 28

#define FINNISH 29

#define TURKISH 30

#define LATIN 31

#define RUSSIAN 32

#define BYELORUSSIAN 33

#define UKRAINIAN 34

#define MACEDONIAN 35

#define BULGARIAN 36

#define JAPANESE     37

#define ESTONIAN 38

#define LITHUANIAN 39

#define LATVIAN 40

#define AFRIKAANS 41

#define ALBANIAN 42

#define CATALAN 43

#define IRISH_GAELIC 44

#define SCOTTISH_GAELIC 45

#define BASQUE 46

#define BRETON 47

#define CORSE 48

#define FRISIAN 49

#define NYNORSK 50

#define INDONESIAN 51

#define MALAY 52

#define SWAHILI 53

#define TAGALOG 54

#define KOREAN 55

#define SCHINESE 56

#define TCHINESE 57

#define QUECHA 59

#define AYMARA 60

#define FAROESE 61

#define FRIULIAN 62

#define GREENLANDIC 63

#define HAITIAN_CREOLE 65

#define RHAETO_ROMAN 66

#define SARDINIAN 67

#define KURDISH 68

#define CEBUANO 69

#define BEMBA 105

#define CHAMORRO 106

#define FIJAN 108

#define GANDA 109

#define HANI 110

#define IDO     111

#define INTERLINGUA 112

#define KICONGO 113

#define KINYARWANDA 114

#define MALAGASY 115

#define MAORI 117

#define MAYAN 118

#define MINANGKABAU 119

#define NAHUATL 120

#define NYANJA 121

#define RUNDI 123

#define SAMOAN 124

#define SHONA 125

#define SOMALI 126

#define SOTHO 127

#define SUNDANESE 128

#define TAHITIAN 129

#define TONGA 130

#define TSWANA 131

#define WOLOF 133

#define XHOSA 134

#define ZAPOTECO 135

#define JAVANESE 139

#define PIDGIN_NIGERIA 142

#define OCCITAN 143

#define MANX 144

#define TOK_PISIN 145

#define BISLAMA 146

#define HILIGAYNON 147

#define KAPAMPANGAN 149

#define BALINESE 150

#define BIKOL 151

#define ILOCANO 152

#define MADURESE 153

#define WARAY 154

#define SERBIAN_LATIN 155

For example:

djvubundle --ocr=lang=German mypages*.djvu mybundledfilewithocr.djvu

This instructs IRIS to recognize German characters. Several languages can be specified at one time:

djvubundle --ocr=lang=GermanFrench mypages*.djvu mybundledfilewithocr.djvu

 

Supporting commands:

djvubundle, djvujoin