Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
Gringo
Posts: 43
Joined: Sun Jan 07, 2007 2:22 am

.LM files for language auto detection POR vs POB

Mon Feb 26, 2007 12:39 am

Dear os & SD-developer,

As a big fan of SD and due to a previous post on Disabling auto-detect language on subtitle upload, I have generated a brazilian and portuguese ngram fingerprint using the recommended ngram.py script, for Subdonwloader's Portuguese/Brazilian language auto detection.

Just tested them with 15 randomly downloaded POR/POB-subtitles in AcvtivePhython using the ngram.py and the newly created .LM files and I'm quite happy that it worked a 100%.

Since, Portuguese and Brazilian are so close, the entire subtitle file was used as input (around 25k characters). Furthermore, I have done a series of string-manipulations (such as stripping out "!", "?", numbers, dialog-hyphens, <i>, {Y:i}) before I submitted the text to the ngram-script).

As I don't know how much of the subtitle file is read for auto-detection in Subdownloader and what kind of manipulations are made, I guess these .LM files won't work in SD. But to differentiate between POR and POB, unfortunately, a few lines won't be good enough. Just see how close the two .LM files are to each other...
portugues and brasileiro.

Let me know what you think and if I can be of any help in this issue.

Best,
Gringo

User avatar
oss
Site Admin
Posts: 5890
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Mon Feb 26, 2007 12:13 pm

very very good. We will implement this into new subdownloader. Also, there is option for uploading still in same language.

THANKS

capiscuas
Posts: 82
Joined: Mon Apr 17, 2006 5:33 am

Thu Jul 31, 2008 9:16 am

Hi Gringo, can you move your request into the new Subdownloader project?

We are currently using a Brazilian.lm that comes by default, no pretreatment is being done with the subtitle before autodetecting language (like removing ?! , not sure if this beneficiates.)

Thanks to use it.

https://bugs.launchpad.net/subdownloader

Return to “Programs using OS”

Who is online

Users browsing this forum: No registered users and 19 guests