OCR and italic

OliSub · Wed Feb 19, 2014 11:51 pm

Hello,

I would like to know if there is a way to keep the <i>italic</i> information when ripping/OCR-ing .idx to .srt ? I'm wasting a lot of time proof-reading my subtitles, playing the DVD in parallel, just to put back this lost info into the file. It takes ages, even in Fast Forward mode...

I'm under linux, using avidemux mini-OCR for this task (then gaupol and Aegisub for syncing and editing, but that's another story). I'm ready to use any other app if it means being able to keep the italics!

TIA
Olivier

Thu Feb 20, 2014 7:42 pm

hi OliSub,

pls try SubExtractor (1032d). This program runs without a problem using WINE in Linux!

NerdNo1 · Thu Feb 20, 2014 10:08 pm

Sadly I am not a linux-user but I can tell you this:

I hardly ever use an automated way to OCR idx/sub files. I try to use the 'image compare' tool whenever possible.
Using a tool like Tesseract gets you odd errors sometimes, some not even detectable (and therefor uncorrectable if using errorcorrection tools or spell-check.
Most such automated tools also can't handle italic. Italic is nice for people that don't hear well such as my dad. Image compare does this quite well (but not flawless).

If you are unable to find a different program that can OCR using image compare you should try to find a program that can read and show the idx/sub files directly (so not by reading through the DVD, this takes too long...). After you have the OCR'd text, you use the other program to quickly sift through the lines to find the italic ones, if you find one, you edit it in the other (I have the same program open twice, one for going through the original .sub and one with the plain OCRs text).
Italic is fairly easy to spot in the images, if the program also has the timings it is easily found in your plain text file.

In the cases that I don't have the original DVD, I go look for an .srt in a different language that does have the italic spots. I often use English .srt files to transfer the italic to my Dutch version. You should try to find tools that can help you ease the process, like here. Before I found SubtitleEdit, I used a multitude of different programs simultaneously.

I just see that there is a linux version as well: http://www.nikse.dk/subtitleedit

If you decide to use SubtitleEdit: I create different folders for different subtitle types in 'image compare' mode, give it a short name that describes what it looks like (color(s), thickness fi) and store them. It pays off in the long run ;-) For some TV series i even created an empty folder with only the italic text and the rest periods, that way I could quickly track the parts where the italic parts were in the sub. After I knew the locations i used Tesseract for the OCR part (some fonts just can't be image compared...).

I'll check this thread again later or tomorrow.

Fri Feb 21, 2014 7:43 am

You can use SubRip for OCR. It works good under Wine on my ubuntu system. http://www.videohelp.com/tools/Subrip
During the OCR you can mark the letters as italic or not. So at the end you have a .srt with italics in it at the right place.

For further work and for my own fansubs I use SubtitleEditor, a nice tool under Linux: viewtopic.php?p=10539#p10539

Btw - as an advice for SubRip: store the OCR'd letters in one great .sum-file and load this sum before you work on a new OCR. After a certain time the letter recognition needs almost no new inputs.

For further information look here: viewtopic.php?f=4&t=14123

NerdNo1 · Fri Feb 21, 2014 2:05 pm

Btw - as an advice for SubRip: store the OCR'd letters in one great .sum-file and load this sum before you work on a new OCR. After a certain time the letter recognition needs almost no new inputs.

I thought that was the easiest way to go too, but for some fonts I have over 300 entries and each time I do a different .sub I add new ones... (the very difficult fonts). After 10-15 DVDs the db becomes too large and the OCR becomes slow (certainly with Windows, SubtitleEdit creates .xml files) I'm pretty sure this will happen with linux too. I therefor recommend to create different .sum for different font-types, you will find the same fonts on many different movies, especially if they're from the same translator/company.

Edit:
I have made a screenshot, in this you see how easy SE works. Right-up you see the original picture, left-below the plain text. This particular font is difficult: upper and lower lines are too close together and image compare sees both lines as one. Tesseract can OCR this but this font will create odd errors (missing line/end of lines, it also sometimes 'sees' an 'l' instead of a 't'...) and Tesseract can't do italic correctly (don't use it or after the OCR job is done select all lines, rightclick and select 'Normal'). Since you can quickly tap through lines in this window, you can see the original image. Italic is done by either highlighting the line, CTRL+I or selecting part of a line and CTRL+I.

OliSub · Fri Feb 21, 2014 7:40 pm

Thank you all for your inputs, they are all very interesting!

I will test them one by one, but since it implies learning new programs, I'll do it gradually in my free time. At least I'm glad to see that there are solutions!

Meanwhile, I have found a method to do it quite nicely with Avidemux mini-OCR. it's the silly way but it works quite well.

Each time I'm asked the signification of an italic glyph, I input its value surrounded by a special character. I chose the character § since I never encountered it in a subtitle so far. So, for instance, the word Italic would be encoded as §I§§t§§a§§l§§i§§c§. It is then a simple matter of systematically applying some basic rules in search and replace to retrieve the original text:
1 - §§ replaced with nothing; (§I§§t§§a§§l§§i§§c§ --> §Italic§)
2 - §([\\N|:- ;,]+)§ replaced with \1 (§T§§w§§o§\N§l§§i§§n§§e§§s§ --> §Two\Nlines§)
3 - §(.*)§ replaced with {\\i1}\1{\\i0}. ( §Italic§ --> {\i1}Italic{\i0})

It works quite well and the formulas can be added to my sed file for automated replacements. Maybe is the method not fool-proof, but as such it is already a time-saver!

As far as the OCR glyphs/letters file is concerned, I also store everything in one big file. Well, not so big so far, I'm a beginner, and I didn't notice any slowdown so far. The glyph file in mini-OCR is binary, not xml, and it might be faster. NerdNo1 suggests to split this file by font types, but are those fonts easily told apart? They all look almost the same to me...

PS: Les Saveurs du palais : nice movie ;-) !

Sat Feb 22, 2014 8:09 am

I thought that was the easiest way to go too, but for some fonts I have over 300 entries and each time I do a different .sub I add new ones... (the very difficult fonts). After 10-15 DVDs the db becomes too large and the OCR becomes slow (certainly with Windows, SubtitleEdit creates .xml files) I'm pretty sure this will happen with linux too. I therefor recommend to create different .sum for different font-types, you will find the same fonts on many different movies, especially if they're from the same translator/company.

My "All-in-One.sum" is now 470 kb large, needs around 20 sec to load it in the program and to OCR one new idx/sub I need around 1 - 2 min.
It is a difficult OCR with many still unknown-to-the-program-letters it's around 5 - 10 min max.
Apart from that it might be a good suggestion to create different .sums for different font-types, but how you differentiate them in advance to load the right sum in the prog?
If t's not the right sum you need to interrupt the process and load another .sum and that costs time too.
Or how you solved this?

NerdNo1 · Sat Feb 22, 2014 1:19 pm

Well first: I don't use your program so my assumption could be wrong. However, it stands to reason that a larger db would take longer to load... 470KB might be small in filesize, but 470KB means 470x1024= 481.280 characters. Each line in this db needs to describe the image+the symbol it represents so lets say 60 charcters per line + the symbol it represents (the image file). But maybe the images are within the .sum, I don't know. Each symbol (image) the OCR software finds goes through that list to find a match, the shorter the list, the quicker the result. There is also a question if the db + images are loaded into memory or read from hdd. I have to say that I only have an old laptop, not a recent i7 or something...

A db listing would have a copy of each symbol in its compare list. If you have gone through, say, 25 DVDs, you'd probably have 7-8 different fonts (I have done something like 2000 and I guess have about 100?). I have a short description of the font, like 'BlWh cc BlWh round 1'. This means the font is black-lined white and custom colors would also be black-lined white, round letters/symbols. Variations are fi BlGrGr, BlWh thick line, or cc Bl, or cc WhBl (you will find such variations) or simply the name of the translating company (usually in the beginning or end of a .sub) or the (tv) series.

Granted: It takes me a little more time to find the right font entry but if you need the best OCR with italic, it pays off (I recognize most fonts now).

I have also found oddities using image compare, the letters are too close together and fi a word ending with 'kt' becomes 'k' and 'fl' becomes 'fi'. Finding and deleting or editing such entries in a large db becomes tedious work and you don't want to delete a complete db because of a few undetected/undetectable errors... (it's another reason why I started with separate db's).

kerremelk · Sun Feb 23, 2014 6:27 pm

I just read someone uses SED, and years ago I used this tool to 'glean' or 'filter' information from driver.INF files.
That's how I know that a scan/replace 'filter' is not at all easy to write.

Can you post these as 'code'?
This may help me and others.
Kind regards, kerremelk.

OliSub · Fri Feb 28, 2014 6:17 pm

I just read someone uses SED, and years ago I used this tool to 'glean' or 'filter' information from driver.INF files.
---8<---
Can you post these as 'code'?

Nothing sophisticated, but here it is:

Code: Select all

#!/bin/sed -rf
# Gestion des italiques
s/§§//g
s/§ §/ /g
s/§([,.;:-]+)§/\1/g
s/§(.*)§/{\\i1}\1{\\i0}/g
# Gestion des guillemets
s/''(.*)''/« \1 »/g
s/''/« /g
s/"(.*)"/« \1 »/g
s/"/« /g
# Gestion de caractères
s/'/’/g
s/- /– /g
# It is the character Unicode 005C that is replaced with œ
s//œ/g
s/lci/Ici/g
s/lls /Ils /g
# Gestion de la ponctuation
s/([;:?!]+)/ \1/g
s/([0-9]) +:([0-9])/\1:\2/g
# Gestion des espaces blancs
s/ +/ /g

It is oriented towards french typography, nothing fancy, but as such it is already a time-saver. I call this file rules.sed and use it like this:

Code: Select all

sed -rf rules.sed subtitles.srt.ocr > subtitles.srt

When applied to the .srt.ocr resulting from the OCR, about 300+ tiny modifications are made to the file. Then I check the spelling, do some search and replace that cannot be automatised, and finally one last systematic proof-reading. The whole process takes between 2 and 3 hours.
I'm sure better results can be achieved with a better command of the SED syntax. Since SED is a line editor, one must especially pay attention not to write rules that would affect the lines concerning the subtitle numbers and/or the timings!

hongdida · Tue May 13, 2014 10:28 am

You can use SubRip for OCR. It works good under Wine on my ubuntu system. http://www.videohelp.com/tools/Subrip ocr
During the OCR you can mark the letters as italic or not. So at the end you have a .srt with italics in it at the right place.

For further work and for my own fansubs I use SubtitleEditor, a nice tool under Linux: viewtopic.php?p=10539#p10539

Btw - as an advice for SubRip: store the OCR'd letters in one great .sum-file and load this sum before you work on a new OCR. After a certain time the letter recognition needs almost no new inputs.

For further information look here: viewtopic.php?f=4&t=14123

One vote for SubtitleEditor, really nice. How could I not know such a nice tool before?haha

kerremelk · Sun Feb 01, 2015 5:26 am

I just read someone uses SED, and years ago I used this tool to 'glean' or 'filter' information from driver.INF files.
---8<---
Can you post these as 'code'?
Nothing sophisticated, but here it is:
Code: Select all
#!/bin/sed -rf # Gestion des italiques s/§§//g s/§ §/ /g s/§([,.;:-]+)§/\1/g s/§(.*)§/{\\i1}\1{\\i0}/g # Gestion des guillemets s/''(.*)''/« \1 »/g s/''/« /g s/"(.*)"/« \1 »/g s/"/« /g # Gestion de caractères s/'/’/g s/- /– /g # It is the character Unicode 005C that is replaced with œ s//œ/g s/lci/Ici/g s/lls /Ils /g # Gestion de la ponctuation s/([;:?!]+)/ \1/g s/([0-9]) +:([0-9])/\1:\2/g # Gestion des espaces blancs s/ +/ /g
It is oriented towards french typography, nothing fancy, but as such it is already a time-saver. I call this file rules.sed and use it like this:
Code: Select all
sed -rf rules.sed subtitles.srt.ocr > subtitles.srt
When applied to the .srt.ocr resulting from the OCR, about 300+ tiny modifications are made to the file. Then I check the spelling, do some search and replace that cannot be automatised, and finally one last systematic proof-reading. The whole process takes between 2 and 3 hours.
I'm sure better results can be achieved with a better command of the SED syntax. Since SED is a line editor, one must especially pay attention not to write rules that would affect the lines concerning the subtitle numbers and/or the timings!

Hi, I forgot I had asked for this 'SED' tool script.
(and thank you for giving it in code frame so I can copy it.)

hmmm, it looks like it is truly the best reason I again run some virtual machine so as to figure out to get me a compilation of truly automated job sequences I cannot do with macroes written in the software programs I am currently using. (as in, using more than one tool to get stuff neat and tidy.)

THX.

mikelilin · Thu Nov 26, 2015 9:28 am

I thought that was the easiest way to go too, but for some fonts I have over 300 entries and each time I do a different .sub I add new ones... (the very difficult fonts). After 10-15 DVDs the db becomes too large and the free online ocr becomes slow (certainly with Windows, SubtitleEdit creates .xml files) I'm pretty sure this will happen with linux too. I therefor recommend to create different .sum for different font-types, you will find the same fonts on many different movies, especially if they're from the same translator/company.

I agree, no matter the .sum how large, the ocr tool can provide multiple language recognize.

Fri Nov 27, 2015 8:38 am

Meanwhile (over 1 1/2 years ago) I changed to DvdSubExtractor with best results.
The handling is a little different from other mentioned programs. But the Extractor provides several clear advantages:

1. The margin of error is almost 0 percent

2. The additional features such as intelligent splitting of interconnected characters during the OCR process, the Undo function for undo any incorrect entries, the intelligent and faultless removing of HI-Items shortly before saving the file (if this feature is activated)

3. the apparently unlimited amount of the memory file (the OcrMap.bin), the fact that it's continue to learn and the necessary OCR inputs with each character once stored will be further minimized

4. the ability to display the recognized characters directly during or after completion of the read-out process in a table so that incorrectly recognized characters can be removed; after that the prog runs another pass, stops at the previously incorrectly stored character and you can correct this now

5. it produces so far no I vs.l errors (!), in which instead of the capital letter I the wrong lower case letter l is detected

The DVDSubExtractor is so accurate that he had already complaines during the OCR by a single pixel and he is waiting for input. Therefore he has a very long training period, but eventually the results are worth it.
If the DVDSubExtractor once is trained properly you get (in most cases) error-free DVD- and Bluray-Subs within several minutes.

OCR and italic

OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Re: OCR and italic

Who is online

OpenSubtitles.org Forum

Contact

All Open Subtitles

Social Links