Assume I have the following tables/fields:
CREATE TABLE tbl_projects (
prjc_id int PRIMARY KEY
)
CREATE TABLE tbl_project_requirements (
preq_prjc_id int -- Foreign key to tbl_projects
preq_type_id int -- A standardized requirement category
)
Given a specific project, I would like to find other projects that have nearly similar requirement categories... or let's say at least a 75% overlap on their requirements.
I could do the following:
DECLARE @prjc_id int = 1
CREATE TABLE #project_reqs (req_type_id int)
INSERT INTO #project_reqs
SELECT preq_req_type_id
FROM tbl_project_requirements
WHERE preq_prjc_id = @prjc_id
SELECT prjc_id
FROM tbl_projects
CROSS APPLY (
SELECT CASE
WHEN COUNT(*) = 0 THEN 0.0
ELSE COALESCE(SUM(CASE WHEN type_id = prjc_type_id THEN 1.0 ELSE 0.0 END), 0.0)
/ CONVERT(float, COUNT(*))
END AS similarity
FROM #project_reqs
FULL OUTER JOIN tbl_project_requirements
ON preq_type_id = type_id
WHERE preq_prjc_id = prjc_id
) reqs
WHERE prjc_id != @prjc_id
AND similarity >= 0.75
In the above, I'm dividing the matched requirement categories by the total distinct requirement categories between each two projects to get the % overlap.
While this works, I sense code smells, and don't think this will scale very well. Is there any sort of method that exists to performantly calculate overlap of child records between two items? Maybe some sort of partial hash matching or...?
Aucun commentaire:
Enregistrer un commentaire