Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
cameny
Posts: 3
Joined: Thu Dec 07, 2006 9:44 pm

Sat Dec 09, 2006 8:46 pm

I restarted computer in the meantime and it looks that v1.2.3 now works ok without any intervention.

probably was some problem beacuse of instalation of v1.2.3 over v1.2.2 still running.
is it possible that it was prob with web/database conection on your side?



about my proposition, everything is possible, people are putting differnet things, but VobSub recognizes as standard for language indetification, text between last two dots in the filename (if all the text before matches the filename of the film).

so for example:
movie The.Departed.TC.XViD-PUKKA.avi

... can have for Croatian language following examples:
The.Departed.TC.XViD-PUKKA.Croatian.srt (or .sub or .txt or whatever)
The.Departed.TC.XViD-PUKKA.HR.srt
The.Departed.TC.XViD-PUKKA.Hrvatski.srt
The.Departed.TC.XViD-PUKKA.CRO.srt
etc.

I think you have to make some table with languages and put some of most common international codes as well as local pronounce of some language in it.

So Entry for example for French language it'll be:
French = French, Français, FR, FRE


I'll will not propose it, but I think it is important, cos it is very easy to make wrong entry in yr database during UL, cos when I'm trying to UL some sub in English language SubDownloader proposes me that it's Croatian language (what's my default and in this case is wrong).


BTW, some other programs as MV2 player (and my home standalone DVD Divx player) have the recongition of the language of the subtitle **directly from the subtitle text**. I don't know how it works, but it works. :D


Have you been thinking to enlarge your subtitle database trying to rip some other biggest subtitles sites of the web?

User avatar
oss
Site Admin
Posts: 5887
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Sun Dec 10, 2006 7:49 pm

Hello cameny,

thanks for great suggestion. I've talked with Ivan (who is coding SubDownloader), I found some python module for language autodetection, so it seems, it will be possible. It will work in two steps:
1. try extract language from subtitle filename
2. if it is not possible, do own language autodetection

ok, now we need list of possible languages in subtitle files (it will be possible to edit them in CONFIG file I guess).

Code: Select all

alb albanian, alb, shqip, sqi, sq ara arabic, ara, ar arm armenian, arm, hye, hy ass assyrian, ass bos bosnian, bosanski, bos, bs bul bulgarian, bg, bul cat catalan, cat, ca, catala chi chinese, chi, zho, zh hrv croatian, hrvatski, hr, cro, hrv, scr cze czech, cs, cz, český, cesky, cze, ces dan danish, dan, da, dansk dut dutch, nl, dut, nld, flemish, nederlands, vlaams eng english, en, eng est estonian, est, et, eesti fin finnish, fin, fi, suomi, suomen fre french, français, fr, fre, francais, fra, francaise ger german, ger, deu, de, deutsch ell greek, gre, ell, el heb hebrew, heb, he hun hungarian, hun, hu, magyar ind indonesian, ind, id, ita italian, it, ita, italiano jpn japanese, jpn, ja kaz kazakh, kaz, kk lav latvian, lav, lv, latviese lit lithuanian, lit, lt nor norwegian, nor, no, norsk per persian, per, far, fa, pol polish, pol, pl, polski por portuguese, portugal, por, pt, portugues pob brazilian, pb, pob rum romanian, rum, ron, ro, romana rus russian, rus, ru scc serbian, scc, srp, sr, slo slovak, sk, svk, slovenský, slovensky, slo, slovencina, slovenčina slv slovenian, slv, sl, slovenscina, slovenščina spa spanish, sp, esp, spa, espanol, espanyol, es, castilian, castellano swe swedish, swe, sv, svenska tha thai, tha, th tur turkish, tur, tr, turkce, turk ukr ukrainian, ukr, uk
first is iso639 language code according to http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

If you will find some errors, or you want add something, just write here

ixquic
Posts: 73
Joined: Wed Aug 23, 2006 4:39 pm

Sun Dec 10, 2006 9:08 pm

Yes, would be great to have that. Maybe even detect all subtitle files for that video in the local folder, and present them in a table-like form so you can verify the data and submit them en bloc?

2 minor corrections:
- Vobsub doesn't require a dot between movie name and language (or whatever), so the filename could be something like The.Departed.TC.XViD-PUKKA_tlh.srt (btw tlh = klingon :wink: )
- pob is not an ISO639 code, it's pt-br (Brazilian portuguese is not a language of its own). I'm aware some people use pob but please include the standard as well.

and why not use one of these ISO-639 codes lists

User avatar
oss
Site Admin
Posts: 5887
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Sun Dec 10, 2006 11:05 pm

Yes, would be great to have that. Maybe even detect all subtitle files for that video in the local folder, and present them in a table-like form so you can verify the data and submit them en bloc?
yes, exactly, this is the plan - in todo is mass uploader, so it should work like you wrote.
2 minor corrections:
- Vobsub doesn't require a dot between movie name and language (or whatever), so the filename could be something like The.Departed.TC.XViD-PUKKA_tlh.srt (btw tlh = klingon :wink: )
I dont think this is good, we will support only .XX.(sub|srt|...) - doted convention.
- pob is not an ISO639 code, it's pt-br (Brazilian portuguese is not a language of its own). I'm aware some people use pob but please include the standard as well.
thats right, bu I needed use some code, so I create this one :)
and why not use one of these ISO-639 codes lists
which exactly you mean ? there are many codes there. If you mean 2character codes, it is not possible by design of app :(

pixxel
Posts: 1
Joined: Tue Dec 12, 2006 11:29 am

Tue Dec 12, 2006 11:40 am

Hi ppl, here's another suggestion - because not all subtitles are well formed like cameny said, could it be possible to detect subtitles by looking at non-common letters (eg, Serbian, Croatian, Bosnian and others have a different diacritical letters such as č, ć, š, ž etc.) These letters can be used to detect language used. There are many collations of both Latin and Cyrillic characters, so this should be a boring job to code, but should provide an excellent detection once it is done.

For more info check
http://en.wikipedia.org/wiki/Diacritic
http://en.wikipedia.org/wiki/Alphabets_ ... _the_Latin
http://en.wikipedia.org/wiki/Alphabets_ ... _sequences
http://en.wikipedia.org/wiki/Cyrillic

Just my two cents, because i saw this auto detection once in some divx player (can't remember, though, which one, it was 3-4 years ago)

User avatar
oss
Site Admin
Posts: 5887
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Tue Dec 12, 2006 1:15 pm

Hi ppl, here's another suggestion - because not all subtitles ..
Just my two cents, because i saw this auto detection once in some divx player (can't remember, though, which one, it was 3-4 years ago)
you didnt read our conversation properly, I wrote:
1. try extract language from subtitle filename
2. if it is not possible, do own language autodetection
and that autodetection works similar as you wrote, but a lot better :)

cameny
Posts: 3
Joined: Thu Dec 07, 2006 9:44 pm

Fri Dec 15, 2006 6:21 pm

Hello cameny,

thanks for great suggestion. I've talked with Ivan (who is coding SubDownloader), I found some python module for language autodetection, so it seems, it will be possible. It will work in two steps:
1. try extract language from subtitle filename
2. if it is not possible, do own language autodetection

ok, now we need list of possible languages in subtitle files (it will be possible to edit them in CONFIG file I guess).

Code: Select all

alb albanian, alb, shqip, sqi, sq ara arabic, ara, ar arm armenian, arm, hye, hy ass assyrian, ass bos bosnian, bosanski, bos, bs bul bulgarian, bg, bul cat catalan, cat, ca, catala chi chinese, chi, zho, zh hrv croatian, hrvatski, hr, cro, hrv, scr cze czech, cs, cz, český, cesky, cze, ces dan danish, dan, da, dansk dut dutch, nl, dut, nld, flemish, nederlands, vlaams eng english, en, eng est estonian, est, et, eesti fin finnish, fin, fi, suomi, suomen fre french, français, fr, fre, francais, fra, francaise ger german, ger, deu, de, deutsch ell greek, gre, ell, el heb hebrew, heb, he hun hungarian, hun, hu, magyar ind indonesian, ind, id, ita italian, it, ita, italiano jpn japanese, jpn, ja kaz kazakh, kaz, kk lav latvian, lav, lv, latviese lit lithuanian, lit, lt nor norwegian, nor, no, norsk per persian, per, far, fa, pol polish, pol, pl, polski por portuguese, portugal, por, pt, portugues pob brazilian, pb, pob rum romanian, rum, ron, ro, romana rus russian, rus, ru scc serbian, scc, srp, sr, slo slovak, sk, svk, slovenský, slovensky, slo, slovencina, slovenčina slv slovenian, slv, sl, slovenscina, slovenščina spa spanish, sp, esp, spa, espanol, espanyol, es, castilian, castellano swe swedish, swe, sv, svenska tha thai, tha, th tur turkish, tur, tr, turkce, turk ukr ukrainian, ukr, uk
first is iso639 language code according to http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

If you will find some errors, or you want add something, just write here
Hey, it's nice that u opened whole topic just about this. :D

thanks on effort, this looks it will be nice feature.

TZOTZIOY
Posts: 25
Joined: Mon Dec 18, 2006 10:26 am
Location: Athens (the original one)
Contact: ICQ Website

codec and language guesser

Tue Dec 19, 2006 9:01 pm

Guys, I do have a python module that currently guesses the encoding of a text file, but I am in the process of converting it to return a tuple of (codec, language_code). All the data needed to train (and verify :) the module are included in the site. It indeed works with a matrix of character (str) pairs and the codec/languages matching.
Should you need it, let me know.
--
Just an earthbound misfit, I.

User avatar
oss
Site Admin
Posts: 5887
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Sat Dec 23, 2006 6:09 pm

hello,

we dont need to know encoding of subtitle file, but we need language. For this we will use common TextCat. read more at: http://www.let.rug.nl/~vannoord/TextCat ... itors.html :)

Return to “Programs using OS”

Who is online

Users browsing this forum: No registered users and 70 guests