Big Data Discussion Group

Meet at Thursdays 3pm to 4pm at SEO 636

Temporary Schedule:

On September 10, Introduction (Lead by Min)

Reading material: A Survey of Statistical Methods and Computing for Big Data

On September 17 and 24, Leveraging sub-sampling approach (Lead by Min). Slide

Reading materials:

Main paper: A Statistical Perspective on Algorithmic Leveraging

Supplemental material: Leveraging for big data regression

On October 01, Big data and Optimal design (Lead by Min)

Abstract: When data volume is too large to be fully analyzed with limited computing resources, one way to extract useful information from massive data is to reduce the data size by carefully selecting a sub-sample. The selecting a subset with maximum information is related to optimal design of experiment. We shall discuss how these two problems connect to each other. Some ongoing research results demonstrate the advantages of a new proposed approach based on optimal design methods.

On October 08, Biomedical Big Data and Precision Medicine (Lead by Jie). Slide

Abstract: The explosion in the availability of biomedical data is creating both great opportunities and challenges for collaborative research among clinicians, genomics and proteomics scientists, molecular biologists, and statisticians. On one hand, electronic medical records and genomic data of a large cohort of individuals are assembled and become available for health study researches. On the other hand, the combined data are extremely high-dimensional and also becoming bigger and bigger, especially the genomic part. For example, the first-generation genomic data from genotyping chips targets genetic variants of about 20,000 genes, while the genome sequencing technique (about 10-fold more expensive than chips) can provide 1000-fold more data. As one of the most critical application areas with the biomedical big data, precision medicine refers to precisely classifying individuals into subpopulations according to their susceptibility to a particular disease and precisely tailoring of medical treatments to subcategories of the disease. Achieving the goals of precision medicine requires combining data across multiple formats and developing novel, sophisticated mathematical, statistical, and computational methods for high-dimensional classification and clustering.

On Oct. 15, Big Data and Supercomputing Using R (Lead by Jie). Slide

Abstract: Handling big data requires high performance computing, which is undergoing rapid change. As the most popular open source statistical software, R and its adds-on packages provide a wide range of high performance computing. It's free with most latest updates, which provides an ideal option for big data researchers. In this talk we introduce different sources of R and its available packages for big data management, high-speed numerical calculations, parallel computing and related random number generations. We also use a case study to compare different approaches of handling big data. For computational intensive R jobs, we use Argo Cluster at UIC to illustrate how we use R for supercomputing.

On Oct. 22, On combining information from multiple sources (Lead by Ryan)

Abstract: In applications where data is too large to store or analyze all at once, an idea is to split the data into a number of smaller chunks, analyze the chunks individually, and then combine the results of these individual analyses somehow. This problem boils down to one of combining information about a common parameter coming from multiple sources, and I will present some of my own (naive?) ideas about how this can be handled. In addition to the big-data applications, the ideas presented here should also be of use in other meta-analysis problems, something I hope to pursue further.

Reading material:

Multivariate Meta-Analysis of Heterogeneous Studies Using Only Summary Statistics: Efficiency and Robustness

Confidence Distribution, the Frequentist Distribution Estimator of a Parameter: A Review

Confidence Distributions and a Unifying Framework for Meta-Analysis

Call for volunteers to lead the following papers or topics you choose:

Bags of Little Bootstrap

Reading material: A Scalable Bootstrap for Massive Data

Mean Log-likelihood

Reading material: A Resampling-Based Stochastic Approximation Method for Analysis of Large Geostatistical Data

Divide and Conquer

Reading materials:

Aggregated estimating equation estimation