vendredi 24 juin 2016

Fuzzy logic matching

So, I'm looking at implementing Fuzzy logic matching in my company and having trouble getting good results. For starters, I'm trying to match up Company names with those on a list supplied by other companies. My first attempt was to use soundex, but it looks like soundex only compares the first few sounds in the company name, so longer company names were too easily confused for one another. I'm now working on my second attempt using the levenstein distance comparison. It looks promising, especially if I remove the punctuation first. However, I'm still having trouble finding duplicates without too many false positives. One of the issues I have is companies such as widgetsco vs widgets inc. So, if I compare the substring of the length of the shorter name, I also pickup things like BBC University and CBC University campus. I suspect that a score using a combination of distance and longest common substring may be the solution. Has anyone managed to build an algorithm that does such a matching with limited false positives?

Aucun commentaire:

Enregistrer un commentaire