Wow, dangerous stuff. It will be very difficult to write an algorithm as smart as the human brain.The question is when they should be grouped and when not. What is the minimum reasonable duration for a line? Could the break be chosen algorithmically? There are clear cases but computers are not very good with fuzzy logic. You must set some fixed limit. What could this be? One must also take into account the time between the utterances.
First of all some mathematical generalities.
The minimum time for one line? Some say 0,7s is the minimum, but personally I think that's way too short, I try to avoid anything shorter than 1,2s and preferably even 1,4s. Then the time between two time units. Some say if it is less than 1 second, the two lines should (could) be merged. If longer, they should stay separate. I more or less agree about this.
But here comes the trouble.
Nice synchronisation involves good spotting, i.e. the moment on which the subtitle starts (in-cue) and when it finishes (out-cue). The out-cue is definitely not before the person speaking finishes speaking. Generally, at least 'a few hundred' milliseconds after. But also it is more 'calm' to do an out-cue together with a camera change. For that reason you could choose to do the out-cue later, up to again 'a few hundred' ms. Altogether let's say there could easily be 600ms of silence after the first spoken text.
Let's say the next subtitle line starts 950ms later, and thus is candidate for merging. But also here there is an in-cue, typically approx. 150ms before speech starts. But if there is a camera change 100ms earlier, I would choose that moment for in-cue.
So now, we have a a situation where the two lines COULD be merged, according to my first given numbers, and your software would merge. But altogether there is an actual silence of 600 + 950 + 150 + 100 = 1,8 seconds. As a human, I would definitely NOT merge. Even more if the second line contains a surprise or a clue or there is any other kind of 'tension'. Only seeing the video and sound and understanding the text and context, it will be possible to make the best choice.
Look at this (not very beautiful) original and think what bad stuff could happen when merging:
#242: But whatever it was...
#243: was his end.
#244: He never came back.
Another situation is where the second line is spoken by another person. The software cannot know, because the dialog marking (the "-") is not present in single lines. Only a human seeing the video file can know. Merging those two lines will not have the dialog marking, and would thus be a bad practice.
In addition to dialogs, maybe the previous time unit was a merge of two lines of text:
Person A: Bla bla bla.
Person B: Neh neh neh.
But the next two lines, the spoken text could be the other way around, first spoken by person B, next by person A. If you now merge these lines, you will get the ABBA effect:
Person A: Bla bla bla.
Person B: Neh neh neh.
Person B: Some more text.
Person A: And some more.
This effect is less 'calm' for the reader and should be avoided.
Also, preferably dialogs in one single time unit have a question and an answer, or (invisible for software which doesn't understand the text) a statement and a reaction. Maybe your software will merge a reaction to a previous statement together with a starting question, or even something completely different after a change of scene. This could be the (less beautiful) result:
Person A: I hope we are not too late, that would be a disaster.
Person B: Don't worry, we still have one hour.
Person C, after scene change: I love you.
Person D: I love you too.
I think these examples show pretty well how automated merging could be a disaster. I would really recommend to NOT do it. In fact I have also seen humans do 'automated' merging, just because it is possible, but without really understanding what they are doing. It is crippling the subtitles and a big frustration for the subtitler who spend many hours on nice spotting. Worse is that it cannot be easily undone afterwards.
Maybe the only safe situation for merging automatically would be if time between two time units is shorter then 100ms AND the second line is a continuation of an unfinished phrase AND both lines are short enough in time AND not too long in number of characters:
The anti-bomb squad will be here...
within half an hour.
These can be merged (but should not exist in good subtitles). The three dots should be erased:
The anti-bomb squad will be here
within half an hour.
Yes, I confessSo, you are admitting formatting and styles are usefulI use italics for 'out of view', spoken text on tv, incidental foreign or alien words, or not at all. It involves a risk, but so be it, I think the risk is small enough to pay for the added value.
But note my nuance. An italic here and there is something else than color, font, position, and god-knows-what.
If a video file is a couple of GB, who cares about subs being 100 or 200 kB.Size doesn't matter.
But yeah, I like small. I don't like these video files of 6GB, where good software with smart settings could achieve (practically) the same quality with less than 1GB. Only if size serves a good purpose and it is in proportion, it may be larger. Same for speed, btw. Otherwise it's silly or even obnoxious, and only a reason to replace our computer every couple of years. Etc.