Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
User avatar
jcdr
Posts: 540
Joined: Sun Apr 08, 2012 9:49 am

Suggestion for all files to be UTF-8

Wed Nov 28, 2012 7:06 pm

In the topic Problems with the addition of new lines in the srt files, we've had a heated but interesting debate on encodings, and whether sub files should be preferably UTF-8 or ANSI.

I often stumble upon a subfile which encoding is very difficult to find. For instance, an 'Elling' movie file has "Let's go to the caf¨¦ around the corner." in it, and I have to try many encodings before finding that it is Chinese GB18030! GB18030 is the PRC national standard, as powerful as UTF-8, but its use is compulsory in PRC for all documents for compatibility reasons with former Chinese standards.

Other examples for other languages :
http://www.opensubtitles.org/en/subtitles/1626/ is an English file encoded in Japanese CP932,
http://www.opensubtitles.org/fr/subtitles/135003/ are English files encoded in Korean CP949,
http://www.opensubtitles.org/en/subtitles/4728198/ are English files encoded in Chinese GB18030,
http://www.opensubtitles.org/en/subtitles/3152794/ is an Indonesian file encoded in Chinese GB18030,
etc. etc.

The rule being that if the file originates from a DVD or Blu-Ray copy made in Asia (and there are many!), it has chances to be encoded with Asian encoding, even if in an Occidental language.

The second point is that some encodings become obsolete fast. Ten years ago, DOS or OEM encodings were still widely used, but who uses them nowadays?
No doubt that nobody will use ISO-8859, Windows (ANSI) or MAC OS encodings in ten years time, but the subtitle files will probably still be there in opensubtitles.org, with users having to guess which f***g vintage encoding was used.

The third point is that users do not follow rules: for instance a Croatian sub should be encoded in CP1250 rather than CP1252, and we should not be able to read "kreme brűlée" instead of "kreme brûlée" in http://www.opensubtitles.org/fr/subtitles/22451/.

The fourth point is that various users have players compatible only with some encodings, and as the encodings are not advertised, users have to download files until they find the suitable encoding (or cf. srtpal using TotalMedia Theatre 5 compatible with UTF16).

The fifth point is that having all files in UTF-8 would allow :
- correct display of in the preview window without � characters;
- much better script coding for spotting similar files;
- automatic language detection;
- OCR or spelling errors detection;
- file corrections by shared users;
- etc
In brief, it would allow a much better quality of the files and website.

The idea being that users could select their preferred encoding(s) in their profile, and download the files in the encoding they wish.

Indeed, the users would not have to upload their files in UTF-8. Rather, the site would detect the encoding -which is possible in 99.5% of the cases-, and ask the user with choices for the remaining 0.5%.

What are your thoughts for/against this?

User avatar
jcdr
Posts: 540
Joined: Sun Apr 08, 2012 9:49 am

Re: Suggestion for all files to be UTF-8

Mon Dec 03, 2012 4:05 pm

Another good example of an encoding a user could not find, commented yesterday:
http://www.opensubtitles.org/fr/subtitles/3668372/les-chinois-a-paris-fr#discussion

User avatar
oss
Site Admin
Posts: 5916
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Suggestion for all files to be UTF-8

Thu Dec 13, 2012 10:42 am

we can proceed on this issue, but I need reliable algo to find right encoding of the source subtitle. For now I am using enca (1.13) http://gitorious.org/enca

I would start to show on website encoding of files. Then if it is 100% reliable I can process further with developing.

Enca supports these languages/encodings:

Code: Select all

# enca --list languages belarussian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855 KOI8-U bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113 czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic croatian: CP1250 ISO-8859-2 IBM852 macce CORK hungarian: ISO-8859-2 CP1250 IBM852 macce CORK lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK slovene: ISO-8859-2 CP1250 IBM852 macce CORK ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr chinese: GBK BIG5 HZ none:
it can not detect if I dont specify the language, or language is not supported in enca list it will return unknown encoding. So I need somehow to find out encoding (and till now I can not find reliable solution). I have routine to detect UTF encodings from BOM file header:

Code: Select all

UTF32_BIG_ENDIAN_BOM UTF32_LITTLE_ENDIAN_BOM UTF16_BIG_ENDIAN_BOM UTF16_LITTLE_ENDIAN_BOM UTF8_BOM
any help welcome.

subshare
Posts: 9
Joined: Fri Jan 07, 2011 4:57 am

Re: Suggestion for all files to be UTF-8

Wed Feb 06, 2013 8:01 am

I suggest to add at least a new field in the uploading form where uploaders can specify the encoding they used (if they know it). Of course, a deluxe version would be that the encoding is automatically detected and then displayed at the subtitles info -- or: all subtitles are automatically converted to UTF-8 on upload, although I'm not sure if this might be prone to errors.

User avatar
jcdr
Posts: 540
Joined: Sun Apr 08, 2012 9:49 am

Re: Suggestion for all files to be UTF-8

Mon Jun 17, 2013 11:39 pm

I suggest to add at least a new field in the uploading form where uploaders can specify the encoding they used (if they know it). Of course, a deluxe version would be that the encoding is automatically detected and then displayed at the subtitles info -- or: all subtitles are automatically converted to UTF-8 on upload, although I'm not sure if this might be prone to errors.
The thing is a majority of uploads is automatic from API clients. And I guess most of the users don't even know which encoding was used.

I have given some thoughts to this issue. Detecting the encoding would not be a problem for the majority of languages, but the less there are high bit characters (> 127 in char table), the more difficult it is to guess the used encoding.

There are three main sets of 8-bit encodings still in common use: Windows (Ansi), MacOS and ISO-8859.
I do hope that Dos OEM and IBM EBCDIC are dead, at least for subtitle encoding.

English, for instance, can be 8-bit encoded with CP1250, CP1252, ISO-8859_1, ISO-8859_3, ISO-8859_4, ISO-8859_9, ISO-8859_13, ISO-8859_15, MacRoman or MacCentralEurope, and some of these encodings differ as much as by one character, eg '¤' vs. '€'. So for some subs, the only thing you can obtain is very similar probabilities for two or three encodings, without obtaining a 100% certainty.

But this should represent a very, very small part of all subs. So I think oss idea to detect and mark the encoding in a first time is a good idea, leaving those very few where no encoding could be ascertained, to be checked by admins.

User avatar
oss
Site Admin
Posts: 5916
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Suggestion for all files to be UTF-8

Wed Jun 19, 2013 9:51 am

it would be more than useful to have some PHP CLASS, which can detect encoding. I didnt find anything useful.

peterjohn740
Posts: 1
Joined: Tue Jul 23, 2013 4:09 pm

Re: Suggestion for all files to be UTF-8

Tue Jul 23, 2013 6:34 pm

it would be more than useful to have some PHP CLASS, which can detect encoding. I didnt find anything useful.
Hi oss, as you said that something is more than useful to have some PHP CLASS, and last sentence you said that you didn't fine anything useful. So, what do you mean by these two sentence? I'm not clear well. Please notify clearly. Thanks.

User avatar
oss
Site Admin
Posts: 5916
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Suggestion for all files to be UTF-8

Tue Jul 23, 2013 7:23 pm

it means, it would be useful to have some PHP class, but I didnt find any, which is doing job properly. There is one more try from jcdr. I will do it within 2 weeks I hope.

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Suggestion for all files to be UTF-8

Wed Jan 01, 2014 12:50 pm

Why not simply force the clients to send only UTF-8?

And forget at last the encoding nightmare.

User avatar
SmallBrother
Site Admin
Posts: 3748
Joined: Sun Mar 04, 2012 12:59 pm
Location: Somewhere on this globe

Re: Suggestion for all files to be UTF-8

Thu Jan 02, 2014 10:03 am

Why not simply force the clients to send only UTF-8?

And forget at last the encoding nightmare.
It would be a sweet dream if everything is UTF-8. That is, for us and for the downloader/user.

But the problem is how to get there. Many people don't know what "encoding" is. Some subtitle software do not support utf-8 (Subtitle Workshop). Etc. I think only allowing utf-8 uploads would dramatically decrease the number of uploads.
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies.
I recommend AirVPN - from € 2,75 per month. Click the below banner for more info.


Image

Return to “General talk”

Who is online

Users browsing this forum: Bing [Bot] and 1 guest