Page 1 of 1

General question to hash algorithms

Posted: Fri Sep 28, 2012 6:38 pm
by kirill
Hi,

I noticed that similar sites like OpenSubtitles use different hashing algorithms.

For example,TheSubDB just does an MD5 hash over the first and last 64 KB. Filesize is not used. Their Python example is misleading.

http://thesubdb.com/api/

Is there an advantage of the OpenSubtitles hashing, which does a CRC64 (?) ?
Were there many hash collisions in the past, or why the additional "moviebytesize" ?
Moviebytesize is already added to the hash, why is it needed for the search?

Do you think the Hashing of the SubDB has advantages or is it flawed to create hash collisions in the future?

Sublight also does the hashing differently: http://www.sublight.si/Article/6/How-to ... -hash.aspx

Re: General question to hash algorithms

Posted: Tue Oct 16, 2012 7:01 pm
by oss
Hi Kirill,

good question, too bad I didnt see it before. Anyway, here is explanation. Read here about OpenSubtitles Hash: http://trac.opensubtitles.org/projects/ ... ourceCodes

First of all, I didn't develop this hash myself - it is based on Gabest MPC, he got already database back in time with hashes, so I thought it would be good idea to make it standard and not develop another hash, like others did and are doing.

So far, there are no collisions, the search itself works perfectly on hash, but we added filesize as required parameter (why not, if everybody got this info).

As you can see, now there are many hashes floating around, almost everybody uses owns hash. I didn't want to make this happen (even opensubtitles back in time could come with new hash, but thats another story), but I am not controlling other programmers. AFAIK there is not big comparison between older CRC64 and younger MD5, which is "stronger", if I would implement hash, it would use SHA1 at least. Problem with CRC64 might be implementation in programming languages, for example PHP API doesn't support it, but we now have source codes in almost every language...

Re: General question to hash algorithms

Posted: Sun Nov 04, 2012 5:36 pm
by eduo
Quick note:

What makes OS's hash impervious to collisions is, precisely, the addition of filesize. It makes sure several formats which can change in the middle of the file but not at the ends are, nonetheless, treated as different files. Adding it as a separate field ensures that the new hash doesn't ever collision.

The standard Capiscuas came up with for Opensubtitles is faster than MD5 hashing (bytesize is required to get the last 64 bytes, so it's no extra processing) and less prone to collisions. The filesize would be removed and collisions most likely wouldn't happen, but since it's always an automated process playing the safer card is better.

Sublight at the time did their own hashing different from OpenSubtitles because they were called out for ripping OS off. Sublight's hashing is a quick hashing mechanism that's ridiculously slower (a lot more of the file must be read and special video recognition libraries must be used). Also, video duration depends on local library used and seconds might get rounded off or the local machine may not have necessary codecs whereas Gabest's hash can be calculated just by having access to the file data itself.

SubDB's hash is not bad, just slower. I have in the past considered adding SubDB support to SolEol and refrained from doing so for other reasons. Sublight's hash is mediocre and, to my eyes, a reflection of the style and principles of its creator.

The problem with any hashing, though, is that it becomes too popular and you end up with a mediocre program supporting it and "tainting" your database. In OpenSubtitles this happened when SubPlayer (or BSPlayer, I can't recall) added automatic upload of subtitles without any control or quality confirmation from the user. To my eyes that was the worst possible use of the API and we've been paying for it since then (not to mention an action worthy of the "malware" description for any player that uses it).

Re: General question to hash algorithms

Posted: Mon Feb 25, 2013 12:20 am
by macofaco
Sublight has basically two versions of hash algorithms: old one and new one :)

New version is much much faster to calculate and very easy to implement. It is described in document http://www.sublight.si/Documents/API/Su ... rvices.pdf on page 6 (document is almost two years old).

Sublight 3 can operate with both versions but new version of Sublight 4 will probably have just newer one.