In the topic Problems with the addition of new lines in the srt files, we've had a heated but interesting debate on encodings, and whether sub files should be preferably UTF-8 or ANSI.
I often stumble upon a subfile which encoding is very difficult to find. For instance, an 'Elling' movie file has "Let's go to the caf¨¦ around the corner." in it, and I have to try many encodings before finding that it is Chinese GB18030! GB18030 is the PRC national standard, as powerful as UTF-8, but its use is compulsory in PRC for all documents for compatibility reasons with former Chinese standards.
Other examples for other languages :
http://www.opensubtitles.org/en/subtitles/1626/ is an English file encoded in Japanese CP932,
http://www.opensubtitles.org/fr/subtitles/135003/ are English files encoded in Korean CP949,
http://www.opensubtitles.org/en/subtitles/4728198/ are English files encoded in Chinese GB18030,
http://www.opensubtitles.org/en/subtitles/3152794/ is an Indonesian file encoded in Chinese GB18030,
etc. etc.
The rule being that if the file originates from a DVD or Blu-Ray copy made in Asia (and there are many!), it has chances to be encoded with Asian encoding, even if in an Occidental language.
The second point is that some encodings become obsolete fast. Ten years ago, DOS or OEM encodings were still widely used, but who uses them nowadays?
No doubt that nobody will use ISO-8859, Windows (ANSI) or MAC OS encodings in ten years time, but the subtitle files will probably still be there in opensubtitles.org, with users having to guess which f***g vintage encoding was used.
The third point is that users do not follow rules: for instance a Croatian sub should be encoded in CP1250 rather than CP1252, and we should not be able to read "kreme brűlée" instead of "kreme brûlée" in http://www.opensubtitles.org/fr/subtitles/22451/.
The fourth point is that various users have players compatible only with some encodings, and as the encodings are not advertised, users have to download files until they find the suitable encoding (or cf. srtpal using TotalMedia Theatre 5 compatible with UTF16).
The fifth point is that having all files in UTF-8 would allow :
- correct display of in the preview window without � characters;
- much better script coding for spotting similar files;
- automatic language detection;
- OCR or spelling errors detection;
- file corrections by shared users;
- etc
In brief, it would allow a much better quality of the files and website.
The idea being that users could select their preferred encoding(s) in their profile, and download the files in the encoding they wish.
Indeed, the users would not have to upload their files in UTF-8. Rather, the site would detect the encoding -which is possible in 99.5% of the cases-, and ask the user with choices for the remaining 0.5%.
What are your thoughts for/against this?