Authors: Daniela Zöller,Stefan Lenz,Harald Binder
ArXiv: 1803.00422
Document:
PDF
DOI
Artifact development version:
GitHub
Abstract URL: http://arxiv.org/abs/1803.00422v2
Data protection constraints frequently require a distributed analysis of
data, i.e., individual-level data remains at many different sites, but analysis
nevertheless has to be performed jointly. The corresponding aggregated data is
often exchanged manually, requiring explicit permission before transfer, i.e.,
the number of data calls and the amount of data should be limited. Thus, only
simple aggregated summary statistics are typically transferred with just a
single call. This does not allow for more complex tasks such as variable
selection. As an alternative, we propose a multivariable regression approach
for identifying important markers by automatic variable selection based on
aggregated data from different locations in iterative calls. To minimize the
amount of transferred data and the number of calls, we also provide a heuristic
variant of the approach. When performing a global data standardization, the
proposed methods yields the same results as when pooling individual-level data.
In a simulation study, the information loss introduced by a local
standardization is seen to be minimal. In a typical scenario, the heuristic
decreases the number of data calls from more than 10 to 3, rendering manual
data releases feasible. To make our approach widely available for application,
we provide an implementation on top of the DataSHIELD framework.