Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
kappy
Posts: 6
Joined: Tue Oct 04, 2011 3:03 pm

Char Encoding

Tue Oct 04, 2011 3:33 pm

Hello,

What is the native char encoding of the subtitles text?


More specific, in xml-rpc method DetectLanguage which text should be gzipped and base64 enconded, but the text itself should be encoding specific? like UTF8, ANSI, or something?

I'm trying to use the method above, but I allways getting language chi or jpn.


Is there any test text I can use to debug this issue, like source-text, gziped data result, base64 result whitch to compare?

User avatar
oss
Site Admin
Posts: 5879
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Char Encoding

Fri Oct 14, 2011 10:42 am

hi, native encoding of subtitles depends of the language.

in background it works like this:

1. user uploads subtitles, we dont check encoding
2. another user download those uploaded subtitles, we dont know encoding

for language recognition we use mo+po files, where are multiple encodings, which each language uses.

All together it means, thats why we got problem to preview subtitles correctly, because find out encoding is harder then one can think.

for detect language use (php): $string = base64_encode(gzcompress($text));

it should work for any encoding, basically - maybe except utf8 :)

kappy
Posts: 6
Joined: Tue Oct 04, 2011 3:03 pm

Re: Char Encoding

Wed Oct 19, 2011 10:29 am

it should work for any encoding, basically - maybe except utf8 :)
Every string on c# is internaly encoded as utf16 by default.
To make this work, I probably need to gzip the byte array direcly.

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Char Encoding

Wed Jan 01, 2014 12:37 pm

Hi

In my opinion, you (the OS developers) should check the character encoding on upload and allow only UTF-8.

As a user I find it extremely annoying having to convert the encoding each time I get a subtitle file.
After, of course, spending some time trying to guess what encoding has been used.

Or, at least, add a field to store the file encoding.

User avatar
rednoah
Posts: 84
Joined: Tue Mar 11, 2008 10:02 pm

Re: Char Encoding

Wed Jan 01, 2014 6:11 pm

IMHO anything that requires the user to select anything won't work. Most people can't even correctly set the language they're uploading, how do you expect to chose them between ANSI, UTF-8, UTF-16, Big5, etc?

Sadly charset guessing is not an exact science. If there was a charset hint in the OS response it would be nice, but it would still be a guess, and I have no issue with doing that guess work client-side.

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Re: Char Encoding

Sun Jan 12, 2014 12:25 am

Couple of things:

A single encoding can't just simply be "chosen" because an enormity (sadly) or players out there still don't support Unicode in any form. UTF-8 would be great. UTF-8 still isn't even the most common encoding in subtitles, although it's getting closer.

Charsets *are* an exact science, but they're one-way: Charsets can't be figured 100% out once applied. Mozillas Charset encoding engine is the best at this, but still fails sometimes.

If I was OSS I'd store every sub received and store its encoding (if it can be figured out) and then provide this encoding back in the response when searching. THis way at least the app can convert the encoding to whatever the user wants.
http://eduo.info/
[url=http://eduo.info/soleol/]OpenSubtitles from your desktop: SolEol for Mac/Windows/Linux[/url]
[url=http://forums.plexapp.com/index.php?showtopic=325&st=0&p=2480&#entry2480]My current episode processing work flow[/url].

User avatar
oss
Site Admin
Posts: 5879
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Char Encoding

Wed Jan 29, 2014 1:47 pm

eduo, I am working on that. It is somehow working now (using your EncoDet), just add subencoding in parameter (iconv encodings, so a lot...):

http://www.opensubtitles.org/en/subtitl ... zations-pb

-> http://dl.opensubtitles.org/en/download/file/1954104687
-> http://dl.opensubtitles.org/en/download ... 1954104687

So please test it, and then I can make more things with that...also I put this into XML-RPC method, it is just not documented yet :) And more goodies are on the way. If you would find some time, maybe you can write webvtt convertor from some formats ? or there are same good classes for converting subtitles formats, so I might implement something more systematic...

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Re: Char Encoding

Wed Jan 29, 2014 11:45 pm

This is very nice.

What format are you using to store internally? Webvtt? It's a super-set of SRT so it's pretty easy to go from WebVTT to SRT (actually, SRT validates as WebVTT, in a very simple way).

The best library there is for format conversion is sublib, as you probably know. Sadly the thing is done in C# which doesn't fit as easily in a web page and some time ago it was absorbed into Gnomesubtitles.
http://eduo.info/
[url=http://eduo.info/soleol/]OpenSubtitles from your desktop: SolEol for Mac/Windows/Linux[/url]
[url=http://forums.plexapp.com/index.php?showtopic=325&st=0&p=2480&#entry2480]My current episode processing work flow[/url].

User avatar
oss
Site Admin
Posts: 5879
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Char Encoding

Thu Jan 30, 2014 6:33 am

thanks. internally we store always original subtitles, so all conversion is made on the fly. There would be good to have of course some PHP conversions, running C# on freebsd server would be very bad idea (using mono and so on). If there is none PHP lib, maybe it is time to make some new...based on sublib...

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Char Encoding

Wed Dec 07, 2016 4:00 pm

Hi.
I wrote a simple bash script to unpack and fix encoding for OS subtitles. It is most useful for poeple (like me) who use subtitles in several languages.

Basically it takes a list of downloaded subtitles. Then unpacks the zip files fixing encoding and names, so you don't end up with a directory full of weird filenames like these:

The.Shining.1980.REMASTERED.UE.iNTERNAL.DVDRiP.X264.CD3-KiSS.srt
The.Shining.1980.BDrip.1080p.x264.DTS.Audio-CHD.Disk2.srt

Its main purpose was not to fix encoding but I added this functionality.

User avatar
SmallBrother
Site Admin
Posts: 3724
Joined: Sun Mar 04, 2012 12:59 pm
Location: Somewhere on this globe

Re: Char Encoding

Wed Dec 07, 2016 6:05 pm

The.Shining.1980.REMASTERED.UE.iNTERNAL.DVDRiP.X264.CD3-KiSS.srt
The.Shining.1980.BDrip.1080p.x264.DTS.Audio-CHD.Disk2.srt
Off-topic, but I couldn't resist...
Is it coincidence that you took "The Shining" as example?
:)

On-topic:
When logged-in as admin, I have somewhere written on the subtitle page what character encoding is detected:

Image

I estimate it correct in let's say 95% of the cases.
Maybe it's an idea to have this on the subtitle page for every user.

The UTF-8 yes or no question, I still don't have an answer. I think technically it's best and most handy if you know what's going on. But reality is that many people don't know about encoding anyway, they use standard settings, some software doesn't even support UTF-8, etc. and then the UTF-8 file is experienced as 'broken' or 'corrupt'. See what I mean? Now what is "the best"?
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies.
I advise AirVPN - from € 2,75 per month. Click the below banner for more info.


Image

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Char Encoding

Wed Dec 07, 2016 6:41 pm

Off topic: yes, it is. Long weird names are everywhere in OS.
I don't understand why people keep doing things that don't add anything to the existing stuff. This is one of those masterpieces that don't deserve lousy remakes like the one made in 1997. Well, perhaps is not lousy but I got recently pissed off when I discovered the film I was about to watch was not the one I expected.

There is no point in doing something that is already done it you are adding nothing. And I don't know if this is what I'm doing here.

I've been searching for some program/library to do automatic encoding detection. What are you using?
I just know about python chardet. But I think it does the job without language information. Knowing the language renders the job much easier.

I am using a utility called "file". It is a general purpose tool to get some information about the format of an
unknown file. It, more or less succeeds with text files and even gives some information about encoding, sometimes erroneously.

User avatar
SmallBrother
Site Admin
Posts: 3724
Joined: Sun Mar 04, 2012 12:59 pm
Location: Somewhere on this globe

Re: Char Encoding

Wed Dec 07, 2016 7:25 pm

Off topic: yes, it is.
Good to know that you are not about to enter my door with an axe, saying "Here comes Hector".
Long weird names are everywhere in OS.
I know you don't like them and I know why. But these long weird names serve a purpose. They reflect the video file name and sometimes that's mandatory for software to use that subtitle with the video file. Also, many uploaders don't state a full and correct release name. Choosing gets harder with release names like "The Shining", "The Shining 1980", "Shining (avi)", "NL subs", "Subtitle Edit", etc. Then luckily the subtitle file name is long and weird :)
I've been searching for some program/library to do automatic encoding detection. What are you using?
Oss would be the one to answer that, but I saw EncoDet by eduo being mentioned earlier.
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies.
I advise AirVPN - from € 2,75 per month. Click the below banner for more info.


Image

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Char Encoding

Wed Dec 07, 2016 8:18 pm

Well, I guess it depends on how you use them. The old machine-readable/human-readable question. Anyways I've never understood the meaning of "release". A movie is a movie. It could have some different versions or cuts (then it should be: "directors_cut") but that's all. I don't understand all the hustle with "DVD" "BlueRay", "Remastered", "1080p", "XVid", "x264"... or anything concerning audio.

Besides there is some more question here. I remember I read it somewhere in here. What if you download a subtitle, unpack it and then you use it weeks or months later. How do you rate or comment it? I mean, how can you obtain the URL from the file name? I think it is better to use some kind of ID. It could be IMDB id but it can be also OS id. That's what I've done. Perhaps automated tools with hash matching and all that jazz can do encoding conversion. But if you use manual downloading you have to do it yourself. Well, this is it:

Code: Select all

#!/bin/bash set -e targetdir=OS_orig_subs test -d $targetdir || mkdir $targetdir cd $targetdir UNZIP=/usr/bin/unzip KONWERT=/usr/bin/konwert [ -x $UNZIP ] || { >&2 echo "You need unzip to extract the files. Please, install it" exit 1 } [ -x $KONWERT ] || { >&2 echo "warning: konwert not found. I won't fix encodings" KONWERT="" } function warning { >&2 echo "warning: $1" } # these are iso 639-2 (not 639-3) codes safe_iso1_languages="eng|esp|fre|ger|ita|fin|swe|pot|pob" function fix_if_not_utf { local likely_encoding1=$(file "$1"|tr -s ' '|cut -d':' -f2-) # remove leading whitespace likely_encoding1=$(echo $likely_encoding1|sed 's/^[ ] +//') likely_encoding2="" local file_is_utf="yes" local fixed="no" local tmp=$(mktemp) if ! echo "$likely_encoding1"|grep -q ^UTF-8 && ! echo "$likely_encoding1"|grep -q ^ASCII; then file_is_utf="no" if [ -n "$KONWERT" ]; then if echo "$likely_encoding1"|grep -q ISO-8859; then # great, it is ISO-8859. Now we need to refine encoding # based on language information likely_encoding2="" # for theese languages file returns erroneusly ISO-8859 # when windows codepages are used so try them first # ISO 639 code for Serbian is "srp" but OS uses "scc" ???? case lang in bul|mac) likely_encoding2=cp1251 ;; ara) likely_encoding2=cp1256 ;; esac if $(echo $safe_iso1_languages|grep -q $2); then likely_encoding2=iso1 else case $2 in alb|bos|cro|cze|pol|scc|slo) likely_encoding=iso2 ;; rus) likely_encoding2=iso5 ;; ell) likely_encoding2=iso7 ;; tur) likely_encoding2=iso9 ;; heb) likely_encoding2=iso8 ;; esac fi else # not ISO. Try Windows codepages likely_encoding2="" if $(echo $safe_iso1_languages|grep -q $2); then likely_encoding2=cp1252 else case $2 in alb|bos|cro|cze|pol|scc|slo) likely_encoding2=cp1250 ;; rus) likely_encoding2=cp1251 ;; ell) likely_encoding2=cp1253 ;; tur) likely_encoding2=cp1254 ;; heb) likely_encoding2=cp1255 ;; esac fi fi if [ -n "$likely_encoding2" ]; then $KONWERT $likely_encoding2-UTF8 "$1" > $tmp && fixed="yes" mv $tmp "$1" fi fi fi # return 2 bool values: is_utf / is_fixed echo "$file_is_utf:$fixed" } for file in $@; do osid=$(echo $file|sed -r -e 's/.*\(([[:digit:]]+)\).zip/\1/') numfiles=$(echo $file|sed 's/.*\.\([[:digit:]]\)cd\..*/\1/') lang=$(echo $file|sed -r 's/.*\.([[:alpha:]]{3})\..*/\1/') if [ $lang != $(isoquery -i639 $lang|cut -f1) ]; then warning "language not recognised" fi file_is_utf_and_fixed="yes:no" seqnum=1 file="../$file" if [ -f $file ]; then echo "extracting: $osid - $numfiles files - lang=$lang" # we need "s/ *$//" because unzip outputs some spaces after # the filenames extracted_files=$(unzip -o $file | sed -nr '/inflating/ {s/ *inflating: *//; s/ *$//; p}') nfo=$(echo "$extracted_files"|egrep nfo$) subs=$(echo "$extracted_files"|egrep "ssa|srt|sub$") numfiles2=$(echo "$subs"|wc -l) if [ $numfiles -ne $numfiles2 ]; then warning "Number of files in archive and number in archive name differ" fi # some archives with only 1 sub (1 cd) erroneously contain # "cd1" in the name so we must treat this special case if [ $numfiles2 -gt 1 ]; then IFS=$'\n' for zfile in $subs; do file_is_utf_and_fixed=$(fix_if_not_utf "$zfile" $lang) # extract cd number from filename (it must contain "cd" o "CD" # if we can't assign a sequential number cdnum=$(echo $zfile|sed -r 's/.*(cd|disk)([[:digit:]]).*/\2/i') if [[ ! $cdnum =~ ^[[:digit:]] ]]; then cdnum="seq$seqnum" seqnum=$((seqnum+1)) fi fileext=${zfile##*.} mv $zfile "${osid}_$cdnum.$fileext" done IFS=$' \t\n' else fileext=${subs##*.} file_is_utf_and_fixed=$(fix_if_not_utf "$subs" $lang) mv "$subs" "$osid.$fileext" fi if [[ $file_is_utf_and_fixed == no* ]]; then warning "$osid is not UTF-8" if [[ $file_is_utf_and_fixed == *:no ]]; then warning "not fixed!" else warning "fixed" fi fi mv "$nfo" "$osid.nfo" fi done

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Char Encoding

Wed Dec 07, 2016 8:44 pm

Good to know that you are not about to enter my door with an axe, saying "Here comes Hector"
Well, sametimes I've felt like that. Most remarkably when you changed the forum design. And sometimes with all the javas*t and ads in the web site. But don't worry the Netherlands is too far from here. And I don't want to do such a big journey with a big axe :D See, ma, I can smile!!!

Return to “Developing”

Who is online

Users browsing this forum: No registered users and 28 guests