Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
andor1999
Posts: 25
Joined: Sat Sep 06, 2014 7:12 am

Capitalization Mystery! Please help.

Sun Sep 10, 2023 5:30 pm

Can anyone help solve this mystery.
I've been seeing a lot of subs lately that only have the most common proper nouns capitalized.
Uncommon last names, places, and abbreviations are all in lower case.
Most of these are RARBG releases.
Some are also full of OCR errors indicating they are OCR's of Blurays.
It's unlikely the Bluray subs were like that.
How is this possible?
This has puzzled me for a long time.
Any ideas? Anyone?

Thanks.

User avatar
scooby007
Site Admin
Posts: 839
Joined: Thu Mar 05, 2009 10:49 pm
Location: Scandalous

Re: Capitalization Mystery! Please help.

Sun Sep 10, 2023 8:53 pm

Been a very long time since I ripped anything (my knowledge may be outdated), but the only logical reason I can think of is that the software people are using is unable to correctly identify specific characters from the image file, then transcription errors may occur. If you assign a wrong image to a character when ripping subtitles, then every-time you use that said software, the same mistake will be made (image recognition matrix). The best way forward is to delete the optical character recognition (OCR) settings that were wrongly assigned in the first place and start from scratch.

These days subtitles are available to rip by the tons, so it could also mean users are creating as many subs from PGS/sub/idx files that they don't really check the quality over quantity.
The race to be the first to upload could be a factor, too.
Using a crappy/outdated dictionary folder (in the program) for the auto correction which doesn't pick up on these inaccuracies?
Maybe just pure laziness?
Maybe there are some other reasons I can't think of.

Depends what they are using to rip, too.
Check out: https://forum.videohelp.com/threads/392 ... port-issue
Indicating it may well be the program users are using.

"Qflta stupid" when it should be:

- What?
- It was a mistake in a stupid magazine.


A big difference, eh?

Hope that helps at least a tid-bit.
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies who collect your data. Click below image for more details and reduce your online digital footprint.

Image

User avatar
mrtinkles
Subtitles Admin
Posts: 142
Joined: Thu Apr 09, 2020 6:46 pm
Location: 37.8270° N, 122.4230° W

Re: Capitalization Mystery! Please help.

Sun Sep 10, 2023 10:57 pm

Many of the apps such as SubitleEdit and others, have various options and settings for case. I have seen this occur in the past when using software based apps for corrections and generation, and then failing to perform a manual review and update afterwards.

Running a quick edit now on a CAPTION conversion with SubtitleEdit, with settings for sentence case checked, produces errors such as:

"Simon and garfunkel" "We're going to a French restaurant in kenosha today." and "Bruce springsteen" .

Also occurs when using other popular Gates or Tux based editors. Non standard names and certain proper nouns will not register, at least until adding the various discrepancies into the apps custom dictionary.

As for OCR issues, have migrated away from SubtitleEdit for OCR, as some better options are available, at least in my humble opinion:

viewtopic.php?t=2881&p=9206#p9206

User avatar
0pener
Posts: 41
Joined: Sat Jul 08, 2023 9:37 am

Re: Capitalization Mystery! Please help.

Mon Sep 11, 2023 7:18 pm

As for OCR issues, have migrated away from SubtitleEdit for OCR, as some better options are available, at least in my humble opinion:
viewtopic.php?t=2881&p=9206#p9206
Thanks for linking it but I must be blind, stupid or both - to which "better options" did you move?

For me, SE v3613 is still the weapon of choice (tried v4 but nah, didnt work so well with linux). And sometimes, very rarely, I dust off even my old SubRip-1.57.1 :lol: Yes, rly, some hardcore cases did work with that when everything else failed.

Problem with SubtitleEdit is that one would have find the proper settings first (picture, OCR method, dictionary)... and then it still might not work with certain languages.

PS/edit: As on topic, yeah!, when you mentioned RARBG... not surprised here, was not the worst site and usually up to date but their stuff didn't really have good quality (all remuxed and some outright toxic). Personally, I am glad they are gone.

User avatar
mrtinkles
Subtitles Admin
Posts: 142
Joined: Thu Apr 09, 2020 6:46 pm
Location: 37.8270° N, 122.4230° W

Re: Capitalization Mystery! Please help.

Mon Sep 11, 2023 11:02 pm

Sorry, was referring to SubExtractor that is listed on the software page. URL Link:
https://www.videohelp.com/software/SubExtractor

Have had very good success with it, an excellent OCR engine and settings provide for adjustments to eliminate those spacing issues that can occur. It will still occasionally generate those common issues such as 1_l_i, but have found it far superior to SE in character recognition. As with any OCR generated subtitle, I recommend always to do a 100% spell check and review afterwards.

Note: Will work under Wine emulation with about 95% success rate. KDE Subtitle Composer is a native and cross platform app, that have had success with OCR conversions under Debian.

https://subtitlecomposer.kde.org/

andor1999
Posts: 25
Joined: Sat Sep 06, 2014 7:12 am

Re: Capitalization Mystery! Please help.

Tue Sep 12, 2023 3:50 am

Thanks for all the replies, but none of the possible explanations really fit.
I mentioned OCR errors, but that was just kind of proof they aren't Closed Caption (ALL CAPS) conversions to normal case.
If they were, that would explain it. Subtitle Edit just uses a lookup for common pronouns.
That's exactly what these subs are like. It's like they were converted to all lower case and then run through SE's case conversion.
But these are bluray OCR's and most are perfect except for the capitalization problems.
I posted this question hoping there was some easy answer I overlooked, but I'm still drawing a blank.
Thanks.

User avatar
mrtinkles
Subtitles Admin
Posts: 142
Joined: Thu Apr 09, 2020 6:46 pm
Location: 37.8270° N, 122.4230° W

Re: Capitalization Mystery! Please help.

Tue Sep 12, 2023 5:36 am

If the default settings on SE has sentence case checked, this behavior can and does impact cleanup from either CAPTIONS or normal sentence case style subtitles. Refer to the “Batch convert” options. If “Redo casing” setting is not unchecked prior to running a cleanup, this will create errors in the existing case with previously good subtitles, based on available dictionary entries. Testing produces errors such as: Eldridge Cleaver_eldridge cleaver, St. Eligius_St. eligius, Boston PD_Boston pd, etc.

This a just a guess as to what is the actual root cause; you can try for yourself and should be able to reproduce similar results on test subs by tweaking with these settings.

andor1999
Posts: 25
Joined: Sat Sep 06, 2014 7:12 am

Re: Capitalization Mystery! Please help.

Tue Sep 12, 2023 2:36 pm

I see. I was aware of those settings, but didn't think about them being used in the Batch convert. I rarely use that.
It makes sense he would be using the batch function. A good possibility. I will definitely experiment with those setting.
Thanks.

andor1999
Posts: 25
Joined: Sat Sep 06, 2014 7:12 am

Re: Capitalization Mystery! Please help.

Tue Sep 12, 2023 4:04 pm

I tried all kinds of different setting and couldn't duplicate the results those weird subs have.
The Redo case option in batch convert is dependent on the current setting of 'Change casing' under Tools.
Only if the sub was all lower case to begin with could I duplicate the problem.
So this was a dead end.
Of course, this is all assuming RARBG was using Subtitle Edit. Could be they were doing OCR in a different way.
I bet they created hundreds of bad subs. What a shame.
I guess I'm going to have to give up on this one.
Thanks for all the suggestions.

urkexia
Posts: 21
Joined: Mon Jul 04, 2022 12:18 am

Re: Capitalization Mystery! Please help.

Thu Sep 14, 2023 7:00 am

Rarbg closed three months prior. When did these errors start?

andor1999
Posts: 25
Joined: Sat Sep 06, 2014 7:12 am

Re: Capitalization Mystery! Please help.

Thu Sep 14, 2023 2:40 pm

I've been seeing them for at least a year, maybe more.
I don't think it's just RARBG, but that's where I've seen it the most.
But, maybe they are the originator. Subs get passed around.
It's been bugging me. I just can't think of any logical way it could happen in the OCR process.
I was also hoping someone else here would have noticed it.

andor1999
Posts: 25
Joined: Sat Sep 06, 2014 7:12 am

Re: Capitalization Mystery! Please help.

Thu Sep 14, 2023 8:51 pm

Found a RARBG sub that shows this lower case pronoun issue.
It includes bad OCR of the music notes so it's definitely a Bluray OCR.
Check out the last names of the band members and studio abbreviations.
Here's a link to it on my Google drive if that's permitted.
https://drive.google.com/file/d/1TimklJ ... sp=sharing

User avatar
mrtinkles
Subtitles Admin
Posts: 142
Joined: Thu Apr 09, 2020 6:46 pm
Location: 37.8270° N, 122.4230° W

Re: Capitalization Mystery! Please help.

Thu Sep 14, 2023 10:51 pm

OK, from quick review this appears almost certainly to be a SubtitleEdit result, generated from PGS or idx/sub (Bluray or DVD). Common occurrences I’ve seen is the j replacement for notes, which is the giveaway. This conversion appears to have been run through with “Fix OCR errors” and not “Prompt for unknown words”, then left to run on its’ own. The way to avoid these aberrations requires human interaction, such as real time monitoring, updating, and then post review, edit and correction. Have seen quite a few end results such as these from the Tesseract engines that are used. Noted some other issues during a quick perusal, and some additional glitches like Jitlss(?),io(?),wnat(what), massachusetfts(Massachusetts)

Note: Go to the “Options” – “Settings” – “Word List” and you will note that there are various lists (libraries) that can be updated to simulate learning, by populating with new word additions to the apps default dictionaries. These should be constantly updated as new subtitles are being generated. This once again, requires human interaction for selection and proper decision making.

Remember also, OCR need to determine characters across an immeasurable amount of fonts that are available. For example, with the OCR engine I prefer, rips from certain sources seem to run quite smoothly and have standardization of the fonts used, while some other sources seem to be problematic and require some tweaking, finagling, and lots of aggravation.

Bottom line is that all subtitles need some sort of human involvement during the conversion process for setup, and also afterwards for the verification process. The example subtitle you provided shows neither of those. Basic human omission.

andor1999
Posts: 25
Joined: Sat Sep 06, 2014 7:12 am

Re: Capitalization Mystery! Please help.

Fri Sep 15, 2023 3:39 pm

I agree with all you said, but how would that explain a line like this:
"At caxton hall, Robin gibb of the Bee Gees marries Molly hullis."
Are OCR errors responsible for it failing to detect the C in Caxton, the H in Hall, the G in Gibbs, and the H in Hullis?
Or is that the way the Bluray sub actually was?
Or, let's assume it did OCR correctly and then some post process selectively converted them back to lower case.
I haven't been able to duplicate any such post process.
The simplest explanation seems to be that that's the way the Bluray subs actually are.
But I find that hard to believe. So, I opened it for discussion here.
Unless I come across both the PGS and the SRT together, I guess we'll never know.

User avatar
0pener
Posts: 41
Joined: Sat Jul 08, 2023 9:37 am

Re: Capitalization Mystery! Please help.

Sat Sep 16, 2023 11:43 am

I agree with all you said, but how would that explain a line like this:
"At caxton hall, Robin gibb of the Bee Gees marries Molly hullis."
...
Unless I come across both the PGS and the SRT together, I guess we'll never know.
Another thing, rather theoretical: Was experimenting with speech recognition (Vosk) and had chosen the wrong language by accident some times... Could it be possible that some are botched speech recognition? The mistakes presented here look rather OCR though (speech would have no/few typos but many wrong words)...

PS: Thanks for the links, mrtinkles. Will look into those progs.

Return to “General talk”

Who is online

Users browsing this forum: No registered users and 21 guests