Problems with the addition of new lines in the srt files

Talk here about new subtitles, movies, site improvements and everything regarding subtitles in ENGLISH language
Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.

Problems with the addition of new lines in the srt files

Postby NomadaPT » Fri Aug 03, 2012 1:06 am

Recently OpenSubtitles started to add to the downloaded subtitles two new lines, one in the beginning (Subtitles downloaded from http://www.OpenSubtitles.org) and another in the end (Download Movie Subtitles Searcher from http://www.OpenSubtitles.org), but I see some problems resulting of this (I suppose automatic) additions:

First, if one subtitle have one line with a early time stamp, in the first minute for instance, the addition of the first line replace the (first) line of the subtitle.

Second, if the srt file is encoded in Unicode (UTF-8 usually) the edition change the encoding to ANSI and all the characters not covered in the ASCII table are replaced, for instance: in a Portuguese sentence like: "Quarenta dias farão um furor tão grande no céu" you got "Quarenta dias farão um furor tão grande no céu", the same happens with subtitles I've uploaded in Hebrew encoded in UTF-8, in this case you got plain gibberish ("אף אחד לא מלמד אותךמה ×–×” אומר להיות אימא טובה." instead of "אם זה האימא שהילד שלי היה רוצה שאני אהיה"), off course that this last problem can be solved by reopening the srt file and re-changing the encoding... but, how many people know how to do that?
NomadaPT
 
Posts: 20
Joined: Mon Dec 22, 2008 3:12 am

Re: Problems with the addition of new lines in the srt files

Postby scooby007 » Sat Aug 04, 2012 11:31 pm

Interesting... The lines don't appear to logged-in users, but I can see how they'd be a problem for logged-out users downloading subtitles. I'll forward this topic to the "admin homepage" section where oss can have a look at it and he'll respond to you here. Thanks for the report.
User avatar
scooby007
Site Admin
 
Posts: 248
Joined: Thu Mar 05, 2009 10:49 pm
Location: Scandalous

Re: Problems with the addition of new lines in the srt files

Postby Omerta » Sat Aug 04, 2012 11:38 pm

UTF-8 sucks in hundred ways by the way. ANSI has all the characters that UTF has but their number in the charcter table is different.
You CAN save your UTF in ANSI if you want, the only thing you have to do is check the "replace UTF characters" button in yout text editor.
User avatar
Omerta
Moderator
 
Posts: 400
Joined: Mon Jul 09, 2007 10:10 am
Location: UK

Re: Problems with the addition of new lines in the srt files

Postby NomadaPT » Sun Aug 05, 2012 4:28 am

Omerta wrote:UTF-8 sucks in hundred ways by the way. ANSI has all the characters that UTF has but their number in the charcter table is different.
You CAN save your UTF in ANSI if you want, the only thing you have to do is check the "replace UTF characters" button in yout text editor.


Omerta, I totally disagree, to be honest I don't see the point of using ANSI and keep changing the code page of the character set, when with Unicode you got one single chart code invariable that covers almost all of the worlds scripts, and doesn't matter what ANSI code page your O.S. are, because you always got the same. Doing a analogy, it's like when you open one pdf file, doesn't matter what you have because the formating is always the same whether in China or in Canada.

Scooby007, thank you very much for your attention in the subject.
NomadaPT
 
Posts: 20
Joined: Mon Dec 22, 2008 3:12 am

Re: Problems with the addition of new lines in the srt files

Postby Omerta » Sun Aug 05, 2012 8:26 am

NomadaPT wrote:
Omerta wrote:UTF-8 sucks in hundred ways by the way. ANSI has all the characters that UTF has but their number in the charcter table is different.
You CAN save your UTF in ANSI if you want, the only thing you have to do is check the "replace UTF characters" button in yout text editor.


Omerta, I totally disagree, to be honest I don't see the point of using ANSI and keep changing the code page of the character set, when with Unicode you got one single chart code invariable that covers almost all of the worlds scripts, and doesn't matter what ANSI code page your O.S. are, because you always got the same. Doing a analogy, it's like when you open one pdf file, doesn't matter what you have because the formating is always the same whether in China or in Canada.

Scooby007, thank you very much for your attention in the subject.


No offense, Nomada, but that thing doesnt matter. What matters is that how many standalone players and softwares support your character code.
And UTF doesnt have this support. You can make subtitles in UTF but big percentage of the users wont be able to use it. Alright I'm off.
User avatar
Omerta
Moderator
 
Posts: 400
Joined: Mon Jul 09, 2007 10:10 am
Location: UK

Re: Problems with the addition of new lines in the srt files

Postby srtpal » Sun Aug 05, 2012 7:50 pm

Omerta wrote:ANSI has all the characters that UTF has but their number in the charcter table is different.

I, too, disagree quite strongly. ANSI does not have all the characters of UTF-8. ANSI is limited to a single code page at a time (slightly less than 256 characters), so if the subtitles need more than one page (and mine usually do because I use ♫ to indicate to the hard-of-hearing people that the text is being sung), Unicode, whether 16-bit or 8-bit, is the only way to go.

Unfortunately, OS rejects the 16-bit Unicode, so for me the only way to go is UTF-8. This is unfortunate because I use two players, VLC and TotalMedia Theatre 5. While VLC supports both, 16-bit Unicode and UTF-8, TM Theatre 5 supports 16-bit Unicode but not UTF-8, which it treats as ANSI and displays all the non-ASCII characters that are part of UTF-8 as ANSI graphics. Additionally, I have not found a way to set the code page in either VLC or TM Theatre 5 and I am certainly not going change my system code page every time I need to watch a movie with subtitles on. So, I prepare all my subtitles in UTF-8 for the upload to OS, then convert them to 16-bit Unicode for my personal use.

There is a good reason why Unicode was invented and why the Internet standards have accepted UTF-8 as the one encoding all Internet protocols must support in this century (they used ISO/IEC 8859-1 as the default in the last century).

I edit all my subtitles in Notepad++ as a text file, as Notepad++ allows me to convert between any code page and UTF-8, then clean everything up by running it through my own srtpal before uploading it here.

By the way, it would be nice if OS displayed the code page for all subtitles, so people could easily convert them to whatever their system needs. Better yet, if it standardized on UTF-8 and rejected subtitles in any other code page. People would then know that everything they download is UTF-8 and, if they so desired, could then convert it to whatever else they want and need.
srtpal
 
Posts: 59
Joined: Sun Jun 21, 2009 5:28 pm

Re: Problems with the addition of new lines in the srt files

Postby Omerta » Sun Aug 05, 2012 8:00 pm

Still, average people use letters to read, not calligraphy.

On the other hand, the idea is not bad, let the users set or change the default text character code somehow.
User avatar
Omerta
Moderator
 
Posts: 400
Joined: Mon Jul 09, 2007 10:10 am
Location: UK

Re: Problems with the addition of new lines in the srt files

Postby srtpal » Sun Aug 05, 2012 8:29 pm

Omerta wrote:Still, average people use letters to read, not calligraphy.

Believe it or not, average people of the world do not use the plain Roman alphabet as is used by English speakers. Considering mere numbers, the average person uses the Chinese script (which cannot be fit into a single code page) or Devanagari (the script of Indic languages) or some derivative of Devanagari. Some 25-30% of people in Europe use the Cyrillic alphabet and most of the rest of Europe uses a modified Roman alphabet (i.e., modified by the use of diacritics, but different languages use different diacritics). And in Greece they use an entire different alphabet but have to use the Roman alphabet for certain names and such.

Long time ago, at university in Slovakia, I was watching an Arabic student who was taking his notes in Arabic, writing from right to left, but wrote any names and any Latin words (this was an anatomy class) left to right in the Roman alphabet. So, pardon me if I cannot think of someone from the UK, or even the rest of Europe and US, Canada and Australia, as an average person.

So, when I post English subtitles for a Czech movie (as seems to be the majority of my subs for some reason), I need to use the unmodified Roman alphabet for the English text and a modified Roman alphabet for the names of the characters. Now I cannot even start imagining the difficulty one would be faced with if he was making Chinese subtitles for a Slovak movie, or Hebrew subtitles for a German movie, etc.

I’d say the average person on this planet needs more than can be fit on a single code page.
srtpal
 
Posts: 59
Joined: Sun Jun 21, 2009 5:28 pm

Re: Problems with the addition of new lines in the srt files

Postby Omerta » Sun Aug 05, 2012 8:42 pm

Alright,
I didnt say anything to prove you wrong did I?
I'm just saying, that UTF is merely unusable to the common video players.

If UTF would be the supported character table, and ANSI would be the unsupported,
I would say, hey we use UTF, not ANSI, get it out of my face!

So I really thank you for your academic enlightment, I am not that dumb.

And again, I say a go for making character codes interchangeable.
User avatar
Omerta
Moderator
 
Posts: 400
Joined: Mon Jul 09, 2007 10:10 am
Location: UK

Re: Problems with the addition of new lines in the srt files

Postby eduo » Sun Aug 05, 2012 9:42 pm

I hadn't noticed that the Open Subtitles "ads" were changing the encoding of the subtitles, but it may explain some issues I'd found in some subs and some reports I used to get from SolEol users.

I assume something has been done, as the "ads" are now coming in english in spanish subs. May be unrelated, but seems like too much of a coincidence.

As for encodings themselves: Everything but UTF is plain evil. It's true that for a long time non-UTF (which is way more than ANSI covers) was so common that a lot of software still doesn't recognize UTF but, in reality, it's in everyone's best interest to see that the shift to UTF is done.

At the very least the subtitles should *always* be stored as UTF in OpenSubtitles and, if anything, an option upon download could convert them (converting UTF to other charsets is always easier than trying to figure out what charset non-UTF has and then converting to UTF). Without pushing for UTF this craziness we've endured for four decades will never stop.

That is, really, the real problem with non-UTF: There's nothing in the file that says what the encoding is. So software can't figure out easily what to use (the best character encoding "guesser" there is has been released by Mozilla and even that fails a LOT)(*).

Developers whose players don't support UTF should be berated and mocked publicly. UTF is almost 30 years old already, for Christ's sake ( http://unicode.org/history/ ). Having to convert to ANSI should be taken like the retrograde embarrassing unnecesary step that it is.

This is an excellent resource explaining both the history of character set encodings and codepages, as well as UTF. Great summary.
http://www.joelonsoftware.com/articles/Unicode.html

(*)Here you can read a fantastic paper about Mozilla's character encoding guessing which, ironically, has character set errors: http://www-archive.mozilla.org/projects ... ction.html
User avatar
eduo
Moderator
 
Posts: 668
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology

Re: Problems with the addition of new lines in the srt files

Postby eduo » Sun Aug 05, 2012 9:49 pm

scooby007 wrote:Interesting... The lines don't appear to logged-in users, but I can see how they'd be a problem for logged-out users downloading subtitles. I'll forward this topic to the "admin homepage" section where oss can have a look at it and he'll respond to you here. Thanks for the report.


I just noticed your mention that it doesn't appear for logged-in users. For API users the lines show for all, regardless of authentication. I imagine that it's a way to compensate for not using the web but thought the mention was worth it.

For a while one of the two links provided in the API (a ZIP and a GZip) didn't include "ads" (I believe it was the GZip) but I can't recall when this was changed.
User avatar
eduo
Moderator
 
Posts: 668
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology

Re: Problems with the addition of new lines in the srt files

Postby NomadaPT » Mon Aug 06, 2012 1:30 am

Being a Western European, I know that is hard to someones understand that the Latin alphabet, despite being worldwide spread, is not worldwide used, and the code page ANSI 1252 used by the western european languages don't even cover the majority or the european languages that use the Latin alphabet themselves, tracing a line eastern to the german and italian languages, with ANSI you need to shift the code page almost three times in Europe (ANSI 1250 for Central and East European Latin, 1254 for Turkish and 1257 for Lithuanian and Latvian).
Eduo is right, everything but Unicode is past. Languages like the English don't use any kind of special characters or diacritics, and the remaining western european languages uses a short range ALL covered by the primary ASCII table, but that is the exception not the rule.

Thank you for your opinion guys, I stated to feel like a moron for using Unicode.
NomadaPT
 
Posts: 20
Joined: Mon Dec 22, 2008 3:12 am

Re: Problems with the addition of new lines in the srt files

Postby srtpal » Mon Aug 06, 2012 2:01 am

Omerta wrote:I didnt say anything to prove you wrong did I?

Yes, you did. Your remark about average people, as if only the 6% of humans who understand English were people worth considering, was smug to say the least.

Omerta wrote:So I really thank you for your academic enlightment

Except there is nothing academic about the 94% of the world that the American standard (ANSI = American National Standards Institute) does not help. Unicode, and UTF-8 in particular, is the standard way of encoding text from the beginning of this century. If your player does not support it, inform its manufacturer you are not willing to pay for something as outdated as their software (or hardware).
srtpal
 
Posts: 59
Joined: Sun Jun 21, 2009 5:28 pm

Re: Problems with the addition of new lines in the srt files

Postby oss » Mon Aug 06, 2012 5:57 am

thanks for posts, nice. Anyway, for testing I need to have URL of subtitles, where this mess appears, or ads are inserted in "wrong" places (replaces subtitle contents for example). Of course ads are inserted automatically.
User avatar
oss
Site Admin
 
Posts: 2208
Joined: Sat Feb 25, 2006 11:26 pm

Re: Problems with the addition of new lines in the srt files

Postby NomadaPT » Mon Aug 06, 2012 10:39 am

Thank you oss.

My attention to the fact was brought by this subtitle in particular: http://www.opensubtitles.org/en/subtitl ... lt-love-pt

The first lines in the original file are:
Code: Select all
1
00:00:41,766 --> 00:00:46,440
ואהבת
E DEVERÁS AMAR

2
00:01:48,480 --> 00:01:52,520
"Salmos de Recuperação Espiritual"

3
00:02:08,320 --> 00:02:12,040
Um salmo de David.
"Bem-aventurado é aquele que atende ao pobre,


and in a downloaded subtitle altered by the ads:

Code: Select all
1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org

2
00:01:48,480 --> 00:01:52,520
"Salmos de Recuperação Espiritual"

3
00:02:08,320 --> 00:02:12,040
Um salmo de David.
"Bem-aventurado é aquele que atende ao pobre,


Apparently the codification issue is already solved, thank you again for that.
NomadaPT
 
Posts: 20
Joined: Mon Dec 22, 2008 3:12 am

Next

Return to General talk

Who is online

Users browsing this forum: Bing [Bot], SmallBrother and 2 guests