I have a SQL database with approximately 9 million observations, divided into 5 tables. I have established ODBC connectivity in R using RODBC and have explored the data a bit. The end-goal of what I am trying to do is to create a database that people with their own research questions related to the data can query the database and extract relevant variables to perform their analyses. To do this, I first need to create a matrix of missingness to figure out what variables exist for which observations. The tables cover the time period from 1994-2012, though 2 tables only cover up to 2011. I envision the matrix looking something like this(see link below). I think making a matrix of missingness for each year, with each table in the database as a row, then every variable in the database as the columns, and in each cell would be the number of observations and % missing. I am open to other ideas of course. I am not sure how to go about making this happen, however, as there are different columns in each table and there are also some columns that are the same (i.e. id, name, etc). I am wondering if it is best to create such a matrix for each table first, with each year as a row, then the variables as columns, and cells containing the same information as in the end-goal matrix. I would then make this a permanent table to be manipulated using R alone. This would prevent my machine from crashing/taking forever by joining all of those observations into a massive table. Then I would still need to unite all the tables by year. Any suggestions/insight/feedback are welcomed. Thank you all in advance.
Individual year of matrix of missingness:
Aucun commentaire:
Enregistrer un commentaire