.LM files for language auto detection POR vs POB

Gringo · Mon Feb 26, 2007 12:39 am

Dear os & SD-developer,

As a big fan of SD and due to a previous post on Disabling auto-detect language on subtitle upload, I have generated a brazilian and portuguese ngram fingerprint using the recommended ngram.py script, for Subdonwloader's Portuguese/Brazilian language auto detection.

Just tested them with 15 randomly downloaded POR/POB-subtitles in AcvtivePhython using the ngram.py and the newly created .LM files and I'm quite happy that it worked a 100%.

Since, Portuguese and Brazilian are so close, the entire subtitle file was used as input (around 25k characters). Furthermore, I have done a series of string-manipulations (such as stripping out "!", "?", numbers, dialog-hyphens, <i>, {Y:i}) before I submitted the text to the ngram-script).

As I don't know how much of the subtitle file is read for auto-detection in Subdownloader and what kind of manipulations are made, I guess these .LM files won't work in SD. But to differentiate between POR and POB, unfortunately, a few lines won't be good enough. Just see how close the two .LM files are to each other...
portugues and brasileiro.

Let me know what you think and if I can be of any help in this issue.

Best,
Gringo

Mon Feb 26, 2007 12:13 pm

very very good. We will implement this into new subdownloader. Also, there is option for uploading still in same language.

THANKS

capiscuas · Thu Jul 31, 2008 9:16 am

Hi Gringo, can you move your request into the new Subdownloader project?

We are currently using a Brazilian.lm that comes by default, no pretreatment is being done with the subtitle before autodetecting language (like removing ?! , not sure if this beneficiates.)

Thanks to use it.

https://bugs.launchpad.net/subdownloader

.LM files for language auto detection POR vs POB

.LM files for language auto detection POR vs POB

Who is online

OpenSubtitles.org Forum

Contact

All Open Subtitles

Social Links