jeudi 23 juin 2016

SQL: performantly find the percent overlap between two foreign key child tables

Assume I have the following tables/fields:

CREATE TABLE tbl_projects (
    prjc_id int PRIMARY KEY
)

CREATE TABLE tbl_project_requirements (
    preq_prjc_id int -- Foreign key to tbl_projects
    preq_type_id int -- A standardized requirement category
)

Given a specific project, I would like to find other projects that have nearly similar requirement categories... or let's say at least a 75% overlap on their requirements.

I could do the following:

DECLARE @prjc_id int = 1

CREATE TABLE #project_reqs (req_type_id int)
INSERT INTO #project_reqs
SELECT preq_req_type_id
FROM tbl_project_requirements
WHERE preq_prjc_id = @prjc_id

SELECT prjc_id
FROM tbl_projects
    CROSS APPLY (
        SELECT CASE 
                WHEN COUNT(*) = 0 THEN 0.0
                ELSE COALESCE(SUM(CASE WHEN type_id = prjc_type_id THEN 1.0 ELSE 0.0 END), 0.0)
                    / CONVERT(float, COUNT(*))
            END AS similarity
        FROM #project_reqs 
            FULL OUTER JOIN tbl_project_requirements
                ON preq_type_id = type_id
        WHERE preq_prjc_id = prjc_id
    ) reqs
WHERE prjc_id != @prjc_id
    AND similarity >= 0.75

In the above, I'm dividing the matched requirement categories by the total distinct requirement categories between each two projects to get the % overlap.

While this works, I sense code smells, and don't think this will scale very well. Is there any sort of method that exists to performantly calculate overlap of child records between two items? Maybe some sort of partial hash matching or...?

Aucun commentaire:

Enregistrer un commentaire