Big Data Discussion
Group
Meet at Thursdays 3pm to 4pm at SEO
636
Temporary Schedule:
On September 10,
Introduction (Lead by Min)
Reading
material: A
Survey of Statistical Methods and Computing for Big Data
On September 17 and 24,
Leveraging sub-sampling approach (Lead by Min). Slide
Reading
materials:
Main
paper: A
Statistical Perspective on Algorithmic Leveraging
Supplemental
material: Leveraging
for big data regression
On October 01, Big data and Optimal design (Lead by Min)
Abstract:
When data volume is too large to be fully analyzed with limited computing
resources, one way to extract useful information from massive data is to reduce
the data size by carefully selecting a sub-sample. The selecting a subset with
maximum information is related to optimal design of experiment. We shall
discuss how these two problems connect to each other. Some ongoing research
results demonstrate the advantages of a new proposed approach based on optimal
design methods.
On October 08, Biomedical
Big Data and Precision Medicine (Lead by Jie). Slide
Abstract:
The explosion in the availability of biomedical data is creating both great
opportunities and challenges for collaborative research among clinicians,
genomics and proteomics scientists, molecular biologists, and statisticians. On
one hand, electronic medical records and genomic data of a large cohort of
individuals are assembled and become available for health study researches. On
the other hand, the combined data are extremely high-dimensional and also
becoming bigger and bigger, especially the genomic part. For example, the
first-generation genomic data from genotyping chips targets genetic variants of
about 20,000 genes, while the genome sequencing technique (about 10-fold more
expensive than chips) can provide 1000-fold more data. As one of the most
critical application areas with the biomedical big data, precision medicine
refers to precisely classifying individuals into subpopulations according to
their susceptibility to a particular disease and precisely tailoring of medical
treatments to subcategories of the disease. Achieving the goals of precision
medicine requires combining data across multiple formats and developing novel,
sophisticated mathematical, statistical, and computational methods for
high-dimensional classification and clustering.
On Oct. 15, Big Data and
Supercomputing Using R (Lead by Jie). Slide
Abstract:
Handling big data requires high performance computing, which is undergoing
rapid change. As the most popular open source statistical software, R and its
adds-on packages provide a wide range of high performance computing. It's free
with most latest updates, which provides an ideal
option for big data researchers. In this talk we introduce different sources of
R and its available packages for big data management, high-speed numerical
calculations, parallel computing and related random number generations. We also
use a case study to compare different approaches of handling big data. For
computational intensive R jobs, we use Argo Cluster at UIC to illustrate how we
use R for supercomputing.
On Oct. 22, On
combining information from multiple sources (Lead by Ryan)
Abstract: In applications where data
is too large to store or analyze all at once, an idea is to
split the data into a number of smaller chunks, analyze the chunks
individually, and then combine the results of these individual analyses
somehow. This problem boils down to one of combining information about a common
parameter coming from multiple sources, and I will present some of my own
(naive?) ideas about how this can be handled. In addition to the big-data
applications, the ideas presented here should also be of use in other meta-analysis
problems, something I hope to pursue further.
Reading material:
Confidence
Distribution, the Frequentist Distribution Estimator of a Parameter: A Review
Confidence
Distributions and a Unifying Framework for Meta-Analysis
Call for volunteers to lead the following papers or topics
you choose:
Bags of Little Bootstrap
Reading material: A
Scalable Bootstrap for Massive Data
Mean Log-likelihood
Reading
material: A
Resampling-Based Stochastic Approximation Method for Analysis of Large
Geostatistical Data
Divide and Conquer
Reading materials:
Aggregated
estimating equation estimation
A
split-and-conquer approach for analysis of extraordinarily large data
Or the topics you choose