Dear os & SD-developer,
As a big fan of SD and due to a previous post on Disabling auto-detect language on subtitle upload, I have generated a brazilian and portuguese ngram fingerprint using the recommended ngram.py script, for Subdonwloader's Portuguese/Brazilian language auto detection.
Just tested them with 15 randomly downloaded POR/POB-subtitles in AcvtivePhython using the ngram.py and the newly created .LM files and I'm quite happy that it worked a 100%.
Since, Portuguese and Brazilian are so close, the entire subtitle file was used as input (around 25k characters). Furthermore, I have done a series of string-manipulations (such as stripping out "!", "?", numbers, dialog-hyphens, <i>, {Y:i}) before I submitted the text to the ngram-script).
As I don't know how much of the subtitle file is read for auto-detection in Subdownloader and what kind of manipulations are made, I guess these .LM files won't work in SD. But to differentiate between POR and POB, unfortunately, a few lines won't be good enough. Just see how close the two .LM files are to each other...
portugues and brasileiro.
Let me know what you think and if I can be of any help in this issue.
Best,
Gringo