TY - GEN
T1 - Grouping of variables to facilitate SDL methods in multivariate data sets
AU - Oganian, Anna
AU - Iacob, Ionut
AU - Lesaja, Goran
N1 - Publisher Copyright:
© 2018, Springer Verlag. All rights reserved.
PY - 2018
Y1 - 2018
N2 - Data sets that are subject to Statistical Disclosure Limitation (SDL) often have many variables of different types that need to be altered for disclosure limitation. To produce a good quality public data set, the data protector needs to account for the relationships between the variables. Hence, ideally SDL methods should not be univariate, that is, treating each variable independently of others, but multivariate, handling many variables at the same time. However, if a data set has many variables, as most government survey data do, the task of developing and implementing a multivariate approach for SDL becomes difficult. In this paper we propose a pre-masking data processing procedure which consists of clustering the variables of high dimensional data sets, so that different groups of variables can be masked independently, thus reducing the complexity of SDL. We consider different hierarchical clustering methods, including our version of hierarchical clustering algorithm, that we call K-Link, and outline how the data protector can define an appropriate number of clusters for these methods. We implemented and applied these methods to two genuine multivariate data sets. The results of the experiments show that K-Link has a potential to solve this problem efficiently. The success of the method, however, depends on the correlation structure of the data. For the data sets where most of the variables are correlated, clustering of variables and subsequent independent application of SDL methods to different clusters may lead to attenuated correlation in the masked data, even for efficient clustering methods. Thereby, the proposed approach is a trade-off between the computational complexity of multivariate SDL methods and data utility loss due to independent treatment of different clusters by SDL methods.
AB - Data sets that are subject to Statistical Disclosure Limitation (SDL) often have many variables of different types that need to be altered for disclosure limitation. To produce a good quality public data set, the data protector needs to account for the relationships between the variables. Hence, ideally SDL methods should not be univariate, that is, treating each variable independently of others, but multivariate, handling many variables at the same time. However, if a data set has many variables, as most government survey data do, the task of developing and implementing a multivariate approach for SDL becomes difficult. In this paper we propose a pre-masking data processing procedure which consists of clustering the variables of high dimensional data sets, so that different groups of variables can be masked independently, thus reducing the complexity of SDL. We consider different hierarchical clustering methods, including our version of hierarchical clustering algorithm, that we call K-Link, and outline how the data protector can define an appropriate number of clusters for these methods. We implemented and applied these methods to two genuine multivariate data sets. The results of the experiments show that K-Link has a potential to solve this problem efficiently. The success of the method, however, depends on the correlation structure of the data. For the data sets where most of the variables are correlated, clustering of variables and subsequent independent application of SDL methods to different clusters may lead to attenuated correlation in the masked data, even for efficient clustering methods. Thereby, the proposed approach is a trade-off between the computational complexity of multivariate SDL methods and data utility loss due to independent treatment of different clusters by SDL methods.
KW - Dimensionality reduction
KW - Hierarchical clustering
KW - Statistical Disclosure Limitation (SDL)
UR - http://www.scopus.com/inward/record.url?scp=85053919840&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-99771-1_13
DO - 10.1007/978-3-319-99771-1_13
M3 - Conference article
AN - SCOPUS:85053919840
SN - 9783319997704
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 187
EP - 199
BT - Privacy in Statistical Databases - UNESCO Chair in Data Privacy, International Conference, PSD 2018, Proceedings
A2 - Montes, Francisco
A2 - Domingo-Ferrer, Josep
PB - Springer Verlag
T2 - International Conference on Privacy in Statistical Databases, PSD 2018
Y2 - 26 September 2018 through 28 September 2018
ER -