Train Tesseract - Tập huấn Tesseract

Vietnamese Optical Character Recognition

Moderator: quân

Re: Train Tesseract - Tập huấn Tesseract

Postby notradamus » Sun Jun 24, 2012 8:38 pm

Hello again,

This seems to be the closest I have gotten to creating my first boxfile using tesseract 3.01 on windows XP using DOS Command window.
I copied the deu-frak language data file and unzipped it into the tessdata folder as you suggested.
I then used a text editor to make a bat file to run the make boxfile command.

@ECHO OFF
cd c:\Program Files\Tesseract-OCR\
tesseract deu-frak.kleist.exp0.tif deu-frak.kleist.exp0 -l deu-frak batch.nochop makebox

###################################
Incidently, this is also where the tiff file is kept in...
c:\Program Files\Tesseract-OCR\

###################################

When I run it I get the following error....
Cannot open input file : deu-frak.kleist.exp0.tif

What am I doing wrong ?

kind regards

Richard
Attachments
deu_frak_error.tif
Error File obtained when trying to create BoxFile in Windows XP using DOS command prompt
deu_frak_error.tif (81.85 KiB) Viewed 3646 times
notradamus
 
Posts: 29
Joined: Fri May 25, 2012 10:07 am

Re: Train Tesseract - Tập huấn Tesseract

Postby notradamus » Sun Jun 24, 2012 8:56 pm

I changed the name of the tiff file to >>> deu-frak.kleist.exp0.tif (which includes the tif suffix even though ".tif" is not part of the name)

I re-ran the deu_frak.bat Bat-file

@ECHO OFF
cd c:\Program Files\Tesseract-OCR\
tesseract deu-frak.kleist.exp0.tif deu-frak.kleist.exp0 -l deu-frak batch.nochop makebox

#########################################################################

And what do you know ..... it actually worked

I now have a corresponding Boxfile called "deu-frak.kleist.exp0.box"

in my c:\Program Files\Tesseract-OCR\ folder.

Also, How do I upload all three files here for others to see ???

Thanks for your patience.
Last edited by notradamus on Tue Jun 26, 2012 3:03 am, edited 1 time in total.
notradamus
 
Posts: 29
Joined: Fri May 25, 2012 10:07 am

Re: Train Tesseract - Tập huấn Tesseract

Postby notradamus » Sun Jun 24, 2012 9:12 pm

I tried re-running the Makebox command for a resized copy of the original TIFF file about one quarter the original size... it still worked but the accuracy was much poorer.

Out of curiosity .... what is the relevance of the "-l" switch/parameter in the middle of the command below
"tesseract deu-frak.kleist.exp0.tif deu-frak.kleist.exp0 -l deu-frak batch.nochop makebox" ?

So my summary is
[filename of TIFF file including suffix] [filename of Boxfile] -l [lang.traineddata without the ".trainneddata"] batch.nochop makebox

this answers the above, namely,
"-l" is a call to use the desired language traineddata file which best matches the source script in the TIFF File.


Cheers

Richard
Last edited by notradamus on Mon Jun 25, 2012 6:20 am, edited 2 times in total.
notradamus
 
Posts: 29
Joined: Fri May 25, 2012 10:07 am

Re: Train Tesseract - Tập huấn Tesseract

Postby quân » Sun Jun 24, 2012 9:58 pm

The -l takes advantage of an existing language data to produce a box file with fewer errors. Otherwise, it would produce a box file with more errors and, as a result, you would have to spend more time and effort to edit and correct the resultant box file. This is stated in Tesseract Wiki.
quân
 
Posts: 236
Joined: Sat Nov 16, 2002 1:51 am
Location: Oxnard, CA - USA

Re: Train Tesseract - Tập huấn Tesseract

Postby notradamus » Mon Jun 25, 2012 7:11 am

Tesseract (software)
Last updated 15 days ago
From Wikipedia, the free encyclopedia


http://en.wikipedia.org/wiki/Tesseract_%28software%29

Now using jTessBoxEditor to edit Boxfiles (4 boxfiles minimum . 32 boxfiles maximum )

http://vietocr.sourceforge.net/training.html

download JTessBoxEditor below

http://sourceforge.net/projects/vietocr/files/jTessBoxEditor/jTessBoxEditor-0.7.zip/download
notradamus
 
Posts: 29
Joined: Fri May 25, 2012 10:07 am

Re: Train Tesseract - Tập huấn Tesseract

Postby notradamus » Tue Jun 26, 2012 6:33 am

I have noticed that the data in my newly created unicharset file does not appear the same as other unicharset....is this difference going to prevent formation of deu-frak language traineddata file ?

eg my deu-frak.unicharset data immediately below and then after that is a dan.unicharset
WHY THE DIFFERENCE IN THE TWO UNICHARSET FILES ????

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
78
NULL 0 NULL 0
H 5 0,255,0,255 NULL 4 # H [48 ]A
L 5 0,255,0,255 NULL 7 # L [4c ]A
t 3 0,255,0,255 NULL 39 # t [74 ]a
h 3 0,255,0,255 NULL 1 # h [68 ]a
e 3 0,255,0,255 NULL 25 # e [65 ]a
o 3 0,255,0,255 NULL 66 # o [6f ]a
l 3 0,255,0,255 NULL 2 # l [6c ]a
g 3 0,255,0,255 NULL 29 # g [67 ]a
i 3 0,255,0,255 NULL 13 # i [69 ]a
s 3 0,255,0,255 NULL 37 # s [73 ]a
ch 3 0,255,0,255 NULL 49 # ch [63 68 ]a
n 3 0,255,0,255 NULL 56 # n [6e ]a
I 5 0,255,0,255 NULL 9 # I [49 ]A
a 3 0,255,0,255 NULL 50 # a [61 ]a
. 10 0,255,0,255 NULL 15 # . [2e ]p
W 5 0,255,0,255 NULL 20 # W [57 ]A
u 3 0,255,0,255 NULL 42 # u [75 ]a
d 3 0,255,0,255 NULL 43 # d [64 ]a
r 3 0,255,0,255 NULL 64 # r [72 ]a
w 3 0,255,0,255 NULL 16 # w [77 ]a
ö 3 0,255,0,255 NULL -1 # ö [f6 ]a
f 3 0,255,0,255 NULL 72 # f [66 ]a
b 3 0,255,0,255 NULL 33 # b [62 ]a
ä 3 0,255,0,255 NULL -1 # ä [e4 ]a
E 5 0,255,0,255 NULL 5 # E [45 ]A
9 8 0,255,0,255 NULL 26 # 9 [39 ]0
" 10 0,255,0,255 NULL 27 # " [22 ]p
m 3 0,255,0,255 NULL 41 # m [6d ]a
G 5 0,255,0,255 NULL 8 # G [47 ]A
~ 10 0,255,0,255 NULL 30 # ~ [7e ]p
, 10 0,255,0,255 NULL 31 # , [2c ]p
v 3 0,255,0,255 NULL 61 # v [76 ]a
B 5 0,255,0,255 NULL 23 # B [42 ]A
ll 3 0,255,0,255 NULL 2 # ll [6c 6c ]a
y 3 0,255,0,255 NULL -1 # y [79 ]a
z 3 0,255,0,255 NULL 54 # z [7a ]a
S 5 0,255,0,255 NULL 10 # S [53 ]A
ß 3 0,255,0,255 NULL 38 # ß [df ]a
T 5 0,255,0,255 NULL 3 # T [54 ]A
k 3 0,255,0,255 NULL 69 # k [6b ]a
M 5 0,255,0,255 NULL 28 # M [4d ]A
U 5 0,255,0,255 NULL 17 # U [55 ]A
D 5 0,255,0,255 NULL 18 # D [44 ]A
ü 3 0,255,0,255 NULL -1 # ü [fc ]a
p 3 0,255,0,255 NULL 48 # p [70 ]a
) 10 0,255,0,255 NULL 46 # ) [29 ]p
3 8 0,255,0,255 NULL 47 # 3 [33 ]0
P 5 0,255,0,255 NULL 45 # P [50 ]A
C 5 0,255,0,255 NULL 67 # C [43 ]A
A 5 0,255,0,255 NULL 14 # A [41 ]A
J 5 0,255,0,255 NULL 74 # J [4a ]A
1 8 0,255,0,255 NULL 52 # 1 [31 ]0
0 8 0,255,0,255 NULL 53 # 0 [30 ]0
Z 5 0,255,0,255 NULL 36 # Z [5a ]A
Q 5 0,255,0,255 NULL -1 # Q [51 ]A
N 5 0,255,0,255 NULL 12 # N [4e ]A
ck 3 0,255,0,255 NULL 49 # ck [63 6b ]a
- 10 0,255,0,255 NULL 58 # - [2d ]p
( 10 0,255,0,255 NULL 59 # ( [28 ]p
tz 3 0,255,0,255 NULL 39 # tz [74 7a ]a
V 5 0,255,0,255 NULL 32 # V [56 ]A
si 3 0,255,0,255 NULL 37 # si [73 69 ]a
? 10 0,255,0,255 NULL 63 # ? [3f ]p
R 5 0,255,0,255 NULL 19 # R [52 ]A
4 8 0,255,0,255 NULL 65 # 4 [34 ]0
O 5 0,255,0,255 NULL 6 # O [4f ]A
c 3 0,255,0,255 NULL 49 # c [63 ]a
5 8 0,255,0,255 NULL 68 # 5 [35 ]0
K 5 0,255,0,255 NULL 40 # K [4b ]A
ss 3 0,255,0,255 NULL 37 # ss [73 73 ]a
6 8 0,255,0,255 NULL 71 # 6 [36 ]0
F 5 0,255,0,255 NULL 22 # F [46 ]A
ri 3 0,255,0,255 NULL 64 # ri [72 69 ]a
j 3 0,255,0,255 NULL 51 # j [6a ]a
å 3 0,255,0,255 NULL -1 # å [e5 ]a
2 8 0,255,0,255 NULL 76 # 2 [32 ]0
« 10 0,255,0,255 NULL 77 # « [ab ]p


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
dan.unicharset

132
NULL 0 Common 0
l 3 Latin 64
i 3 Latin 97
s 3 Latin 9
t 3 Latin 53
e 3 Latin 79
b 3 Latin 71
. 10 Common 7
a 3 Latin 86
S 5 Latin 3
å 3 Latin 121
d 3 Latin 70
r 3 Latin 23
n 3 Latin 110
y 3 Latin 49
< 0 Common 15
g 3 Latin 29
h 3 Latin 66
v 3 Latin 122
k 3 Latin 45
o 3 Latin 59
F 5 Latin 36
ð 3 Latin 22
R 5 Latin 12
$ 0 Common 24
* 10 Common 25
Æ 5 Latin 63
m 3 Latin 32
j 3 Latin 85
G 5 Latin 16
6 8 Common 30
5 8 Common 31
M 5 Latin 27
1 8 Common 33
- 10 Common 34
3 8 Common 35
f 3 Latin 21
ø 3 Latin 129
q 3 Latin 90
4 8 Common 39
[ 10 Common 40
: 10 Common 41
) 10 Common 42
] 10 Common 43
( 10 Common 44
K 5 Latin 19
u 3 Latin 81
2 8 Common 47
0 8 Common 48
Y 5 Latin 14
p 3 Latin 69
â 3 Latin 51
ä 3 Latin 76
T 5 Latin 4
, 10 Common 54
@ 10 Common 55
_ 10 Common 56
W 5 Latin 105
ö 3 Latin 123
O 5 Latin 20
è 3 Latin 60
> 0 Common 61
~ 0 Common 62
æ 3 Latin 26
L 5 Latin 1
° 0 Common 65
H 5 Latin 17
« 10 Common 67
8 8 Common 68
P 5 Latin 50
D 5 Latin 11
B 5 Latin 6
+ 0 Common 72
C 5 Latin 80
ó 3 Latin 74
{ 10 Common 75
Ä 5 Latin 52
£ 0 Common 77
' 10 Common 78
E 5 Latin 5
c 3 Latin 73
U 5 Latin 46
ë 3 Latin 82
& 10 Common 83
/ 10 Common 84
J 5 Latin 28
A 5 Latin 8
7 8 Common 87
= 0 Common 88
ñ 3 Latin 89
Q 5 Latin 38
X 5 Latin 118
fl 3 Latin 92
\ 10 Common 93
” 10 Common 94
® 0 Common 95
# 10 Common 96
I 5 Latin 2
à 5 Latin 98
? 10 Common 99
é 3 Latin 116
ü 3 Latin 101
» 10 Common 102
á 3 Latin 103
z 3 Latin 106
w 3 Latin 57
Z 5 Latin 104
ç 3 Latin 107
! 10 Common 108
í 3 Latin 109
N 5 Latin 13
; 10 Common 111
} 10 Common 112
` 0 Common 113
" 10 Common 114
“ 10 Common 115
É 5 Latin 100
fi 3 Latin 117
x 3 Latin 91
% 10 Common 119
9 8 Common 120
Å 5 Latin 10
V 5 Latin 18
Ö 5 Latin 58
§ 0 Common 124
© 0 Common 125
| 0 Common 126
^ 0 Common 127
ê 3 Latin 128
Ø 5 Latin 37
à 3 Latin 130
„ 10 Common 131
notradamus
 
Posts: 29
Joined: Fri May 25, 2012 10:07 am

Re: Train Tesseract - Tập huấn Tesseract

Postby quân » Tue Jun 26, 2012 1:29 pm

I'm not really sure. Can you post this question on Tesseract Forum?
quân
 
Posts: 236
Joined: Sat Nov 16, 2002 1:51 am
Location: Oxnard, CA - USA

Previous

Return to VietOCR

Who is online

Users browsing this forum: No registered users and 1 guest

cron