guidelines for subs

Sun Jul 26, 2015 12:15 pm

The question is when they should be grouped and when not. What is the minimum reasonable duration for a line? Could the break be chosen algorithmically? There are clear cases but computers are not very good with fuzzy logic. You must set some fixed limit. What could this be? One must also take into account the time between the utterances.

Wow, dangerous stuff. It will be very difficult to write an algorithm as smart as the human brain.

First of all some mathematical generalities.
The minimum time for one line? Some say 0,7s is the minimum, but personally I think that's way too short, I try to avoid anything shorter than 1,2s and preferably even 1,4s. Then the time between two time units. Some say if it is less than 1 second, the two lines should (could) be merged. If longer, they should stay separate. I more or less agree about this.

But here comes the trouble.

Nice synchronisation involves good spotting, i.e. the moment on which the subtitle starts (in-cue) and when it finishes (out-cue). The out-cue is definitely not before the person speaking finishes speaking. Generally, at least 'a few hundred' milliseconds after. But also it is more 'calm' to do an out-cue together with a camera change. For that reason you could choose to do the out-cue later, up to again 'a few hundred' ms. Altogether let's say there could easily be 600ms of silence after the first spoken text.

Let's say the next subtitle line starts 950ms later, and thus is candidate for merging. But also here there is an in-cue, typically approx. 150ms before speech starts. But if there is a camera change 100ms earlier, I would choose that moment for in-cue.

So now, we have a a situation where the two lines COULD be merged, according to my first given numbers, and your software would merge. But altogether there is an actual silence of 600 + 950 + 150 + 100 = 1,8 seconds. As a human, I would definitely NOT merge. Even more if the second line contains a surprise or a clue or there is any other kind of 'tension'. Only seeing the video and sound and understanding the text and context, it will be possible to make the best choice.

Look at this (not very beautiful) original and think what bad stuff could happen when merging:

#242: But whatever it was...
#243: was his end.
#244: He never came back.

Another situation is where the second line is spoken by another person. The software cannot know, because the dialog marking (the "-") is not present in single lines. Only a human seeing the video file can know. Merging those two lines will not have the dialog marking, and would thus be a bad practice.

In addition to dialogs, maybe the previous time unit was a merge of two lines of text:
Person A: Bla bla bla.
Person B: Neh neh neh.
But the next two lines, the spoken text could be the other way around, first spoken by person B, next by person A. If you now merge these lines, you will get the ABBA effect:

Person A: Bla bla bla.
Person B: Neh neh neh.

Person B: Some more text.
Person A: And some more.

This effect is less 'calm' for the reader and should be avoided.

Also, preferably dialogs in one single time unit have a question and an answer, or (invisible for software which doesn't understand the text) a statement and a reaction. Maybe your software will merge a reaction to a previous statement together with a starting question, or even something completely different after a change of scene. This could be the (less beautiful) result:

Person A: I hope we are not too late, that would be a disaster.

Person B: Don't worry, we still have one hour.
Person C, after scene change: I love you.

Person D: I love you too.

I think these examples show pretty well how automated merging could be a disaster. I would really recommend to NOT do it. In fact I have also seen humans do 'automated' merging, just because it is possible, but without really understanding what they are doing. It is crippling the subtitles and a big frustration for the subtitler who spend many hours on nice spotting. Worse is that it cannot be easily undone afterwards.

Maybe the only safe situation for merging automatically would be if time between two time units is shorter then 100ms AND the second line is a continuation of an unfinished phrase AND both lines are short enough in time AND not too long in number of characters:

The anti-bomb squad will be here...

within half an hour.

These can be merged (but should not exist in good subtitles). The three dots should be erased:

The anti-bomb squad will be here
within half an hour.

I use italics for 'out of view', spoken text on tv, incidental foreign or alien words, or not at all. It involves a risk, but so be it, I think the risk is small enough to pay for the added value.
So, you are admitting formatting and styles are useful

Yes, I confess

But note my nuance. An italic here and there is something else than color, font, position, and god-knows-what.

Size doesn't matter.

If a video file is a couple of GB, who cares about subs being 100 or 200 kB.
But yeah, I like small. I don't like these video files of 6GB, where good software with smart settings could achieve (practically) the same quality with less than 1GB. Only if size serves a good purpose and it is in proportion, it may be larger. Same for speed, btw. Otherwise it's silly or even obnoxious, and only a reason to replace our computer every couple of years. Etc.

hector · Sun Jul 26, 2015 3:09 pm

Wow, I think I'll need some time to digest this.

But I think I get the idea. My question is: why are we treating this case specially? I mean, when you join two different characters in one unit. As you say, merging those two lines should have the dialog marking. So, in some way it is a different case than only one character in one unit. This is because when you write the SRT you are adding some personal work. You are adding some information. You are breaking the audio track into pieces (you decide where) and breaking those pieces into lines (and you also decide where). And when you decide to join two utterances by different characters you are discarding information. You leave out the out-cue of first utterance and the in-cue of the second one.

I think we could speak of "raw" timings and "cooked" timings. In the same way we could speak of "raw" lines and "cooked" lines. We could define "raw timing" as the point at which the sound starts and "cooked timing" the point that you, as author or synchroniser set based on all those aspects you are talking about. Perhaps the former one could be easily detected by a computer. The latter one is more complex.

But being picky, even those "raw timings" are problematic because you could sometimes set them for every word, specially when the speech is slow. So, when you set the in-cue and out-cue somehow you are already adding some information of your own. There is nothing keeping you from setting them for every word except common sense. Yeah, but computers don't have common sense.

My idea is to let the computer "cook" as much as possible. Yes, I know it is hard to admit, but computers are almost always better than humans. I see it is more complex than I thought but I think it can be done. The same for line breaking. Perhaps the main problem is semantic content. Here we enter the field of artificial intelligence. Moving sands to me. I don't think a computer can detect whether an utterance is a reaction to the last one or not. At least, mine cannot. :-)

Scene change or camera change is something I don't pay much attention to though I know it could make a difference

I can only say that synchronising and utterance merging isn't as simple as it looked at first glance.

hector · Mon Jul 27, 2015 2:04 pm

Let's see something more tangible. Sometimes my head goes a little wild :-)

Suppose you have one utterance with duration 1.6 seconds and then a reaction to this. For example:
00:00:01,000 --> 00:00:02,600
I'm gonna get all those awful dresses
and make a big bonfire..

00.00.02,700 --> 00:00:03,100
Don't!.

Evidently the second is too short (taking your 0.7 seconds as guidance). It's a good candidate to be merged. But the first one is long and already takes two lines. What then? Sometimes I've seen this:
00:00:01,000 --> 00:00:03,100
I'm gonna get all those awful dresses
and make a big bonfire... -Don't!

I guess this is better than:
00:00:01,000 --> 00:00:03,100
-I'm gonna get all those awful dresses
and make a big bonfire...
-Don't!

or even:
00:00:01,000 --> 00:00:03,100
-I'm gonna get all those awful dresses and make a big bonfire...
-Don't!

Now, let's take the same example but this time the second is not a reaction. Let's suppose that somebody shouts behind the main scene. Then you can't call it a "dialogue". They are two unrelated utterances:
00:00:01,000 --> 00:00:03,100
I'm gonna get all those awful dresses
and make a big bonfire... -Wait!

What then?

Mon Jul 27, 2015 3:08 pm

Yes, I know it is hard to admit, but computers are almost always better than humans.

-cough- -cough-
Yes, it is hard to admit, but mostly because I think it is not true. Computers may be faster. They may be able to combine a huge and complex series of considerations without getting nervous. But any software must be written by a human, who knows what needs to be done and then tell the computer how it must be done. All software have bugs - which is btw not the fault of the computer, on the contrary. The very maximum a computer could achieve is therefor what a human could achieve. Not more, and most probably less.

I have seen humans acting like computers, changing subtitles I mean

It shows that it takes more than following some rules. For making good subtitles you MUST have the video file with it.

And that's why I think to leave the cooking to a chef.

Scene change or camera change is something I don't pay much attention to though I know it could make a difference

Not sure what you mean. You mean you didn't know about it, or you think it's not important?
Anyway, a very general but very good guideline for making subs is that they should be kind of unnoticed, invisible. If they are distracting for whatever reason, something is not good.

Suppose you have one utterance with duration 1.6 seconds and then a reaction to this. For example:
00:00:01,000 --> 00:00:02,600
I'm gonna get all those awful dresses
and make a big bonfire..

00.00.02,700 --> 00:00:03,100
Don't!

First of all, the first time unit is 1,6s for 60 characters (excluding the three dots). That is 38 characters per second (CPS) and WAY too much. A very comfortable speed is 20 CPS and personally I try to avoid going over 24 CPS. Depends a bit on the words used. Many short words are more difficult than a few long words. But 38 CPS is impossible to read. So here is already something wrong and this should be fixed.

Anyway, the few solutions you give, I think they are all not very elegant. If there is really no other way, I think the least bad is this:
I'm gonna get all those awful dresses
and make a big bonfire... -Don't!

But maybe there are other (better) solutions:
1. If there is time left after "Don't!", just add some time and do not merge.
2. If that is not available, compress the first line, for example to
"I'll get all those dresses and make a bonfire"
That is 45 chars, could be on one line, and then merging with "Don't" is possible.
3. Or maybe the "Don't" can be merged with the following line.
4. Maybe just don't merge and leave it as it is.
5. Do not subtitle "Don't" at all. Maybe the meaning can be understood from the sound, the face and body language.

In difficult situations like this, the subtitler will have to be creative and choose what is the best...
Good luck with that algorithm

Now, let's take the same example but this time the second is not a reaction. Let's suppose that somebody shouts behind the main scene. Then you can't call it a "dialogue".
What then?

Besides what I wrote before, I think 'subtitle-dialogs' don't NEED to be real dialogs. Also if they are not, the two lines can be merged and then it just shows that two different people are saying something. But in this case it is just another (although a bit small) reason to not merge and find another solution. But if this happens a lot, the subtitler can choose to merge, to avoid too many subsequent 'flashes', a stroboscope effect.

hector · Wed Jul 29, 2015 1:07 pm

First of all, the first time unit is 1,6s for 60 characters (excluding the three dots). That is 38 characters per second (CPS) and WAY too much. A very comfortable speed is 20 CPS and personally I try to avoid going over 24 CPS. Depends a bit on the words used. Many short words are more difficult than a few long words. But 38 CPS is impossible to read. So here is already something wrong and this should be fixed.

Yes, I'm beginning to be aware of CPS, another thing I wasn't paying much attention to.
But this is something I want to discuss. Because it's not my fault. I mean, sometimes people speak slowly and sometimes hectic. This is because I try to write literally what is being said. Nothing more, nothing less. Well, let's say I'm relaxing a bit this rule because at first I wrote everything. Things like:

I think that - that - that
you are ve - very beautiful

Well... huh... I... I don't know

Now I'm beginning to think it's too much. Because it is hard to read and all those "huh", "well", "you know" and "..." add no information. Perhaps you can see that somebody is hesitating or nervous but this you can get from context and intonation.

Why do it this way? Because I often use subtitles to learn the language and I think this is better achieved with this style.

And this leads me to an idea that I had: instead of the useless (my personal opinion) HD flag, could we have a "literal" or "close-to-speech" flag?. This would be useful only for transcriptions (not translations). So you could choose.
viewtopic.php?f=1&t=15083

hector · Thu Sep 03, 2015 1:29 pm

I was wondering about something...

What if you are translating something and find an expression that needs an explanation? It is not so uncommon. Sometimes you
need to know something about the culture to understand a phrase. It happens frequently with jokes. Hoe do you translate a joke
based for example on similar words. Those similar words are not similar at all in the other language.

In a book it's easy. You write T.N. (Translator Note) but you can't put a translator note which might take two or three lines in an SRT, can you?

Thu Sep 03, 2015 1:54 pm

My opinion is that such a translator's note should not be in a subtitle. It would take a lot more text (and time) which is probably not available. And if it is, explaining would distract too much from the movie. Bottom-line is that subtitles should be easy and calm. It's inevitable that sometimes things get lost.

Translating jokes or play on words is often pretty difficult, if not impossible. Then I would choose for just translating the meaning, the essence of what is being said. It's a pity that the joke has evaporated, but hey, life is tough. Sometimes a dialog doesn't make any sense anymore. Then you have to be very creative, and careful not to become silly.

For expressions you can often find an equivalent. I would try to find that equivalent, rather than just translating the actual meaning.

Just my opinion...

hector · Mon Dec 07, 2015 1:07 am

Well, I think I'm handling with matters of a higher level now than whitespace and characters.

I was using some subtitle and I found things like (I quote):
you *can't* know
it will *not* attend
she was hangin' up
Um-hmm. Just hope she
Try eatin' somethin'
Ha, ha, ha, ha. That was a good one.
Hey, man. How ya doin'?
and I coulda stopped

apart from some 90-character lines but that's another story.

I want to focus on how you transcribe. We shouldn't forget we are putting in text some spoken language. The "*not*" thing is
some way to say this word is emphasised but I think it just makes reading difficult.

I think the problem here is that sometimes we mix two different "styles": the regular subtitles and those for hearing-impaired. If you are doing "regular" subtitles you don't need to remark some words like "not" or "can't" because you get it from the audio. In the same way you don't have to write Um-hmm (meaning "yes") because you hear it and you don't have to write "ha, ha, ha" because you can hear he is laughing. So my conclusion is that this is too much. You get this information from intonation and sound and it makes reading difficult.

About things like "how ya doin'" or "coulda stopped", I think it is good to have what I called "literal" subtitles. That is, close to the real audio. But this is too much. What do you think? I woulda ;-) written "How're you doing?" and "could have". This is language specific but the same thing happens in Spanish and French with other words and constructs.

If some day we get the "literal" flag, we should talk about where is the line that keeps apart literal from non-literal but the line separating "regular" from "hearing-impaired" is clear or should be.

Tue Dec 08, 2015 5:21 pm

I would like that "literal" flag. The 'subs' wouldn't be subs anymore, probably violating one subtitle guideline after the other, but could be a valuable base for translations.

I think I agree with you. Everything (well, in that last post).

- Literal (nor HI) should not be phonetic, that's something else (and yes, less readable). "Yeah" and "We're going home" and maybe "We're gonna have a party", but that would be the max for me. No ya, doin', kinda.
In Dutch some phonetic spellings would be okay, but IMHO only when it's some kind of fixed expression. "Die goeie ouwe tijd" (literally: That good old time) is spelled wrongly, but spelling it correct (Goede oude tijd) would sound/look a bit silly. I think.
Maybe it also depends on how often and how extreme. I think "How ya doin', ma ol' nigga?" is not English anymore.

- Don't mix HI and literal. Ironically, subtitling "ha ha ha" when someone is laughing is a bit funny. For HI, maybe better use [John laughing]. For regular subtitles I would just ignore "Um-hmm". For HI subs maybe replace it by "yes".

- Don't emphasise with * or similar characters. Don't emphasise in literal text, or maybe use italics (I guess preferably in HI subs).
For me, italics is the 'maximum' markup I will use in an SRT.

hector · Thu Dec 10, 2015 12:48 pm

Got it. Thanks :-)

Dimsokol · Wed Jul 13, 2016 12:35 pm

My idea is to let the computer "cook" as much as possible. Yes, I know it is hard to admit, but computers are almost always better than humans. I see it is more complex than I thought but I think it can be done. The same for line breaking. Perhaps the main problem is semantic content. Here we enter the field of artificial intelligence. Moving sands to me. I don't think a computer can detect whether an utterance is a reaction to the last one or not. At least, mine cannot. :-)

I think you are wrong. Computer can! Neuro-network did a great jump last years in natural lang processing. For example on oodles the AI detects the emotion of comments. Youtube can generate subtitles automatically based on neuron network.

Aaron_0105 · Wed Aug 09, 2017 11:57 am

Another guideline is when people make subtitles for movies they should also include the lyrics to the songs that are playing. and not just put the title of the song and who it was sung by up. Last 2 films say that I've looked for have had the "hearing impaired" icon but the songs aren't there. Most of the time the lyrics are just as important as the dialogue.

guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Re: guidelines for subs

Who is online

OpenSubtitles.org Forum

Contact

All Open Subtitles

Social Links