Publications

An up-to-date list is available on Google Scholar

2021

  1. CaPC Learning: Confidential and Private Collaborative Learning
    Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, Xiao Wang
    In ICLR (International Conference on Learning Representations) 2021

    Paper Slides Video Code Blog Post

    Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other’s data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
    @inproceedings{capc2021iclr,
      title = {CaPC Learning: Confidential and Private Collaborative Learning},
      author = {Choquette-Choo, Christopher A. and Dullerud, Natalie and Dziedzic, Adam and Zhang, Yunxiang and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      booktitle = {ICLR (International Conference on Learning Representations)},
      year = {2021}
    }
    
  1. Preoperative paraspinal neck muscle characteristics predict early onset adjacent segment degeneration in anterior cervical fusion patients: A machine-learning modeling analysis
    Arnold Y. L. Wong, Garrett Harada, Remy Lee, Sapan D. Gandhi, Adam Dziedzic, Alejandro Espinoza-Orias, Mohamad Parnianpour, Philip K. Louie, Bryce Basques, Howard S. An, Dino Samartzis
    Journal of Orthopaedic Research 2021

    Paper

    Abstract Early onset adjacent segment degeneration (ASD) can be found within six months after anterior cervical discectomy and fusion (ACDF). Deficits in deep paraspinal neck muscles may be related to early onset ASD. This study aimed to determine whether the morphometry of preoperative deep neck muscles (multifidus and semispinalis cervicis) predicted early onset ASD in patients with ACDF. Thirty-two cases of early onset ASD after a two-level ACDF and 30 matched non-ASD cases were identified from a large-scale cohort. The preoperative total cross-sectional area (CSA) of bilateral deep neck muscles and the lean muscle CSAs from C3 to C7 levels were measured manually on T2-weighted magnetic resonance imaging. Paraspinal muscle CSA asymmetry at each level was calculated. A support vector machine (SVM) algorithm was used to identify demographic, radiographic, and/or muscle parameters that predicted proximal/distal ASD development. No significant between-group differences in demographic or preoperative radiographic data were noted (mean age: 52.4 ± 10.9 years). ACDFs comprised C3 to C5 (n = 9), C4 to C6 (n = 20), and C5 to C7 (n = 32) cases. Eighteen, eight, and six patients had proximal, distal, or both ASD, respectively. The SVM model achieved high accuracy (96.7%) and an area under the curve (AUC = 0.97) for predicting early onset ASD. Asymmetry of fat at C5 (coefficient: 0.06), and standardized measures of C7 lean (coefficient: 0.05) and total CSA measures (coefficient: 0.05) were the strongest predictors of early onset ASD. This is the first study to show that preoperative deep neck muscle CSA, composition, and asymmetry at C5 to C7 independently predicted postoperative early onset ASD in patients with ACDF. Paraspinal muscle assessments are recommended to identify high-risk patients for personalized intervention.
    @article{wong2021ML,
      author = {Wong, Arnold Y. L. and Harada, Garrett and Lee, Remy and Gandhi, Sapan D. and Dziedzic, Adam and Espinoza-Orias, Alejandro and Parnianpour, Mohamad and Louie, Philip K. and Basques, Bryce and An, Howard S. and Samartzis, Dino},
      title = {Preoperative paraspinal neck muscle characteristics predict early onset adjacent segment degeneration in anterior cervical fusion patients: A machine-learning modeling analysis},
      journal = {Journal of Orthopaedic Research},
      volume = {39},
      number = {8},
      pages = {1732-1744},
      keywords = {adjacent segment, cervical, degeneration, disc, disease, muscles, paraspinal, spine},
      doi = {https://doi.org/10.1002/jor.24829},
      eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/jor.24829},
      year = {2021}
    }
    
  2. On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples
    Adelin Travers, Lorna Licollari, Guanghan Wang, Varun Chandrasekaran, Adam Dziedzic, David Lie, Nicolas Papernot
    2021 Preprint

    Paper

    Machine learning (ML) models are known to be vulnerable to adversarial examples. Applications of ML to voice biometrics authentication are no exception. Yet, the implications of audio adversarial examples on these real-world systems remain poorly understood given that most research targets limited defenders who can only listen to the audio samples. Conflating detectability of an attack with human perceptibility, research has focused on methods that aim to produce imperceptible adversarial examples which humans cannot distinguish from the corresponding benign samples. We argue that this perspective is coarse for two reasons: 1. Imperceptibility is impossible to verify; it would require an experimental process that encompasses variations in listener training, equipment, volume, ear sensitivity, types of background noise etc, and 2. It disregards pipeline-based detection clues that realistic defenders leverage. This results in adversarial examples that are ineffective in the presence of knowledgeable defenders. Thus, an adversary only needs an audio sample to be plausible to a human. We thus introduce surreptitious adversarial examples, a new class of attacks that evades both human and pipeline controls. In the white-box setting, we instantiate this class with a joint, multi-stage optimization attack. Using an Amazon Mechanical Turk user study, we show that this attack produces audio samples that are more surreptitious than previous attacks that aim solely for imperceptibility. Lastly we show that surreptitious adversarial examples are challenging to develop in the black-box setting.
    @misc{travers2021exploitability,
      title = {On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples},
      author = {Travers, Adelin and Licollari, Lorna and Wang, Guanghan and Chandrasekaran, Varun and Dziedzic, Adam and Lie, David and Papernot, Nicolas},
      year = {2021},
      eprint = {2108.02010},
      archiveprefix = {arXiv},
      primaryclass = {cs.SD},
      journal = {preprint arXiv:2108.02010}
    }
    
  3. When the Curious Abandon Honesty: Federated Learning Is Not Private
    Franziska Boenisch, Adam Dziedzic, Roei Schuster, Ali Shahin Shamsabadi, Ilia Shumailov, Nicolas Papernot
    2021 Preprint

    Paper

    In federated learning (FL), data does not leave personal devices when they are jointly training a machine learning model. Instead, these devices share gradients with a central party (e.g., a company). Because data never "leaves" personal devices, FL is presented as privacy-preserving. Yet, recently it was shown that this protection is but a thin facade, as even a passive attacker observing gradients can reconstruct data of individual users. In this paper, we argue that prior work still largely underestimates the vulnerability of FL. This is because prior efforts exclusively consider passive attackers that are honest-but-curious. Instead, we introduce an active and dishonest attacker acting as the central party, who is able to modify the shared model’s weights before users compute model gradients. We call the modified weights "trap weights". Our active attacker is able to recover user data perfectly and at near zero costs: the attack requires no complex optimization objectives. Instead, it exploits inherent data leakage from model gradients and amplifies this effect by maliciously altering the weights of the shared model. These specificities enable our attack to scale to models trained with large mini-batches of data. Where attackers from prior work require hours to recover a single data point, our method needs milliseconds to capture the full mini-batch of data from both fully-connected and convolutional deep neural networks. Finally, we consider mitigations. We observe that current implementations of differential privacy (DP) in FL are flawed, as they explicitly trust the central party with the crucial task of adding DP noise, and thus provide no protection against a malicious central party. We also consider other defenses and explain why they are similarly inadequate. A significant redesign of FL is required for it to provide any meaningful form of data privacy to users.
    @misc{boenisch2021curious,
      title = {When the Curious Abandon Honesty: Federated Learning Is Not Private},
      author = {Boenisch, Franziska and Dziedzic, Adam and Schuster, Roei and Shamsabadi, Ali Shahin and Shumailov, Ilia and Papernot, Nicolas},
      year = {2021},
      eprint = {2112.02918},
      archiveprefix = {arXiv},
      primaryclass = {cs.LG},
      journal = {preprint arXiv:2112.02918}
    }
    

2020

  1. Pretrained Transformers Improve Out-of-Distribution Robustness
    Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song
    In ACL (Association for Computational Linguistics) 2020

    Paper Slides Video Code

    Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers’ performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
    @inproceedings{hendrycks-etal-2020-pretrained,
      title = {Pretrained Transformers Improve Out-of-Distribution Robustness},
      author = {Hendrycks, Dan and Liu, Xiaoyuan and Wallace, Eric and Dziedzic, Adam and Krishnan, Rishabh and Song, Dawn},
      booktitle = { ACL (Association for Computational Linguistics)},
      month = jul,
      year = {2020},
      address = {Online},
      publisher = {ACL (Association for Computational Linguistics)},
      doi = {10.18653/v1/2020.acl-main.244},
      pages = {2744--2751}
    }
    
  1. Machine Learning based detection of multiple Wi-Fi BSSs for LTE-U CSAT
    Vanlin Sathya, Adam Dziedzic, Monisha Ghosh, Sanjay Krishnan
    In ICNC (International Conference on Computing, Networking and Communications) 2020

    Paper

    According to the LTE-U Forum specification, a LTE-U base-station (BS) reduces its duty cycle from 50% to 33% when it senses an increase in the number of co-channel Wi-Fi basic service sets (BSSs) from one to two. The detection of the number of Wi-Fi BSSs that are operating on the channel in real-time, without decoding the Wi-Fi packets, still remains a challenge. In this paper, we present a novel machine learning (ML) approach that solves the problem by using energy values observed during LTE-U OFF duration. Observing the energy values (at LTE-U BS OFF time) is a much simpler operation than decoding the entire Wi-Fi packets. In this work, we implement and validate the proposed ML based approach in real-time experiments, and demonstrate that there are two distinct patterns between one and two Wi-Fi APs. This approach delivers an accuracy close to 100% compared to auto-correlation (AC) and energy detection (ED) approaches.
    @inproceedings{sathya2020machine,
      title = {Machine Learning based detection of multiple Wi-Fi BSSs for LTE-U CSAT},
      author = {Sathya, Vanlin and Dziedzic, Adam and Ghosh, Monisha and Krishnan, Sanjay},
      booktitle = {ICNC (International Conference on Computing, Networking and Communications)},
      year = {2020},
      organization = {IEEE}
    }
    
  2. An Empirical Evaluation of Perturbation-based Defenses
    Adam Dziedzic, Sanjay Krishnan
    preprint arXiv:2002.03080 2020 Preprint

    Paper

    Recent work has extensively shown that randomized perturbations of a neural network can improve its robustness to adversarial attacks. The literature is, however, lacking a detailed compare-and-contrast of the latest proposals to understand what classes of perturbations work, when they work, and why they work. We contribute a detailed experimental evaluation that elucidates these questions and benchmarks perturbation defenses in a consistent way. In particular, we show five main results: (1) all input perturbation defenses, whether random or deterministic, are essentially equivalent in their efficacy, (2) such defenses offer almost no robustness to adaptive attacks unless these perturbations are observed during training, (3) a tuned sequence of noise layers across a network provides the best empirical robustness, (4) attacks transfer between perturbation defenses so the attackers need not know the specific type of defense only that it involves perturbations, and (5) adversarial examples very close to original images show an elevated sensitivity to perturbation in a first-order analysis. Based on these insights, we demonstrate a new robust model built on noise injection and adversarial training that achieves state-of-the-art robustness.
    @article{dziedzic2020empirical,
      title = {An Empirical Evaluation of Perturbation-based Defenses},
      author = {Dziedzic, Adam and Krishnan, Sanjay},
      journal = {preprint arXiv:2002.03080},
      year = {2020}
    }
    
  3. Machine Learning enabled Spectrum Sharing in Dense LTE-U/Wi-Fi Coexistence Scenarios
    Adam Dziedzic, Vanlin Sathya, Muhammad Rochman, Monisha Ghosh, Sanjay Krishnan
    OJVT (IEEE Open Journal of Vehicular Technology) 2020

    The application of Machine Learning (ML) techniques to complex engineering problems has proved to be an attractive and efficient solution. ML has been successfully applied to several practical tasks like image recognition, automating industrial operations, etc. The promise of ML techniques in solving non-linear problems influenced this work which aims to apply known ML techniques and develop new ones for wireless spectrum sharing between Wi-Fi and LTE in the unlicensed spectrum. In this work, we focus on the LTE-Unlicensed (LTE-U) specification developed by the LTE-U Forum, which uses the duty-cycle approach for fair coexistence. The specification suggests reducing the duty cycle at the LTE-U base-station (BS) when the number of co-channel Wi-Fi basic service sets (BSSs) increases from one to two or more. However, without decoding the Wi-Fi packets, detecting the number of Wi-Fi BSSs operating on the channel in real-time is a challenging problem. In this work, we demonstrate a novel ML-based approach which solves this problem by using energy values observed during the LTE-U OFF duration. It is relatively straightforward to observe only the energy values during the LTE-U BS OFF time compared to decoding the entire Wi-Fi packet, which would require a full Wi-Fi receiver at the LTE-U base-station. We implement and validate the proposed ML-based approach by real-time experiments and demonstrate that there exist distinct patterns between the energy distributions between one and many Wi-Fi AP transmissions. The proposed ML-based approach results in a higher accuracy (close to 99% in all cases) as compared to the existing auto-correlation (AC) and energy detection (ED) approaches.
    @article{dziedzic2020machine,
      title = {Machine Learning enabled Spectrum Sharing in Dense LTE-U/Wi-Fi Coexistence Scenarios},
      author = {Dziedzic, Adam and Sathya, Vanlin and Rochman, Muhammad and Ghosh, Monisha and Krishnan, Sanjay},
      journal = {OJVT (IEEE Open Journal of Vehicular Technology)},
      year = {2020},
      publisher = {IEEE}
    }
    
  4. Input and Model Compression for Adaptive and Robust Neural Networks
    Adam Dziedzic
    2020 Thesis

    @article{dziedzic2020input,
      title = {Input and Model Compression for Adaptive and Robust Neural Networks},
      author = {Dziedzic, Adam},
      year = {2020},
      publisher = {The University of Chicago}
    }
    

2019

  1. Band-limited Training and Inference for Convolutional Neural Networks
    Adam Dziedzic, Ioannis Paparizzos, Sanjay Krishnan, Aaron Elmore, Michael Franklin
    In ICML (International Conference on Machine Learning) 2019

    Paper Slides Video

    The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.
    @inproceedings{dziedzic2019band,
      title = {Band-limited Training and Inference for Convolutional Neural Networks},
      author = {Dziedzic, Adam and Paparizzos, Ioannis and Krishnan, Sanjay and Elmore, Aaron and Franklin, Michael},
      booktitle = {ICML (International Conference on Machine Learning)},
      year = {2019}
    }
    
  1. Artificial intelligence in resource-constrained and shared environments
    Sanjay Krishnan, Aaron J Elmore, Michael Franklin, John Paparrizos, Zechao Shang, Adam Dziedzic, Rui Liu
    ACM SIGOPS Operating Systems Review 2019

    Paper

    The computational demands of modern AI techniques are immense, and as the number of practical applications grows, there will be an increasing burden on shared computing infrastructure. We envision a forthcoming era of "AI Systems" research where reducing resource consumption, reasoning about transient resource availability, trading off resource consumption for accuracy, and managing contention on specialized hardware will become the community’s main research focus. This paper overviews the history of AI systems research, a vision for the future, and the open challenges ahead.
    @article{krishnan2019artificial,
      title = {Artificial intelligence in resource-constrained and shared environments},
      author = {Krishnan, Sanjay and Elmore, Aaron J and Franklin, Michael and Paparrizos, John and Shang, Zechao and Dziedzic, Adam and Liu, Rui},
      journal = {ACM SIGOPS Operating Systems Review},
      volume = {53},
      number = {1},
      pages = {1--6},
      year = {2019},
      publisher = {ACM New York, NY, USA}
    }
    

2018

  1. Columnstore and B+ Tree - Are Hybrid Physical Designs Important?
    Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, Manoj Syamala
    In SIGMOD (ACM Special Interest Group on Management of Data) 2018

    Paper Slides

    Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied — a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.
    @inproceedings{dziedzic2018index,
      title = {Columnstore and B+ Tree - Are Hybrid Physical Designs Important?},
      author = {Dziedzic, Adam and Wang, Jingjing and Das, Sudipto and Ding, Bolin and Narasayya, Vivek R. and Syamala, Manoj},
      booktitle = {SIGMOD (ACM Special Interest Group on Management of Data)},
      year = {2018}
    }
    
  1. Deeplens: Towards a visual data management system
    Sanjay Krishnan, Adam Dziedzic, Aaron J Elmore
    CIDR (Conference on Innovative Data Systems Research) 2018

    Paper

    Advances in deep learning have greatly widened the scope of automatic computer vision algorithms and enable users to ask questions directly about the content in images and video. This paper explores the necessary steps towards a future Visual Data Management System (VDMS), where the predictions of such deep learning models are stored, managed, queried, and indexed. We propose a query and data model that disentangles the neural network models used, the query workload, and the data source semantics from the query processing layer. Our system, DeepLens, is based on dataflow query processing systems and this research prototype presents initial experiments to elicit important open research questions in visual analytics systems. One of our main conclusions is that any future "declarative" VDMS will have to revisit query optimization and automated physical design from a unified perspective of performance and accuracy tradeoffs. Physical design and query optimization choices can not only change performance by orders of magnitude, they can potentially affect the accuracy of results.
    @article{krishnan2018deeplens,
      title = {Deeplens: Towards a visual data management system},
      author = {Krishnan, Sanjay and Dziedzic, Adam and Elmore, Aaron J},
      journal = {CIDR (Conference on Innovative Data Systems Research)},
      year = {2018}
    }
    

2017

  1. Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.
    Tim Mattson, Vijay Gadepally, Zuohao She, Adam Dziedzic, Jeff Parkhurst
    In CIDR (Conference on Innovative Data Systems Research) 2017

    Paper

    In most Big Data applications, the data is heterogeneous. As we have been arguing in a series of papers, storage engines should be well suited to the data they hold. Therefore, a system supporting Big Data applications should be able to expose multiple storage engines through a single interface. We call such systems, polystore systems. Our reference implementation of the polystore concept is called BigDAWG (short for the Big Data Analytics Working Group). In this demonstration, we will show the BigDAWG system and a number of polystore applications built to help ocean metage-nomics researchers handle their heterogenous Big Data.
    @inproceedings{mattson2017demonstrating,
      title = {Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.},
      author = {Mattson, Tim and Gadepally, Vijay and She, Zuohao and Dziedzic, Adam and Parkhurst, Jeff},
      booktitle = {CIDR (Conference on Innovative Data Systems Research)},
      year = {2017}
    }
    
  1. Bigdawg polystore release and demonstration
    Kyle OBrien, Vijay Gadepally, Jennie Duggan, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Zuohao She, Michael Stonebraker
    preprint arXiv:1701.05799 2017 Preprint

    Paper

    @article{obrien2017bigdawg,
      title = {Bigdawg polystore release and demonstration},
      author = {OBrien, Kyle and Gadepally, Vijay and Duggan, Jennie and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and She, Zuohao and Stonebraker, Michael},
      journal = {preprint arXiv:1701.05799},
      year = {2017}
    }
    
  2. Version 0.1 of the bigdawg polystore system
    Vijay Gadepally, Kyle OBrien, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Jennie Rogers, Zuohao She, Michael Stonebraker
    preprint arXiv:1707.00721 2017 Preprint

    Paper

    @article{gadepally2017version,
      title = {Version 0.1 of the bigdawg polystore system},
      author = {Gadepally, Vijay and OBrien, Kyle and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and Rogers, Jennie and She, Zuohao and Stonebraker, Michael},
      journal = {preprint arXiv:1707.00721},
      year = {2017}
    }
    
  3. BigDAWG version 0.1
    Vijay Gadepally, Kyle O’Brien, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Jennie Rogers, Zuohao She, Michael Stonebraker
    In HPEC (IEEE High Performance Extreme Computing) 2017

    @inproceedings{gadepally2017bigdawg,
      title = {BigDAWG version 0.1},
      author = {Gadepally, Vijay and O'Brien, Kyle and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and Rogers, Jennie and She, Zuohao and Stonebraker, Michael},
      booktitle = {HPEC (IEEE High Performance Extreme Computing)},
      pages = {1--7},
      year = {2017},
      organization = {IEEE}
    }
    
  4. Data Loading, Transformation and Migration for Database Management Systems
    Adam Dziedzic
    2017 Thesis

    @article{dziedzic2017data,
      title = {Data Loading, Transformation and Migration for Database Management Systems},
      author = {Dziedzic, Adam},
      year = {2017},
      publisher = {The University of Chicago}
    }
    
  5. September 2017. BigDAWG Version 0.1
    V Gadepally, K O’Brien, A Dziedzic, A Elmore, J Kepner, S Madden, T Mattson, J Rogers, Z She, M Stonebraker
    HPEC (IEEE High Performance Extreme Computing) 2017

    @article{gadepallyseptember,
      title = {September 2017. BigDAWG Version 0.1},
      author = {Gadepally, V and O'Brien, K and Dziedzic, A and Elmore, A and Kepner, J and Madden, S and Mattson, T and Rogers, J and She, Z and Stonebraker, M},
      journal = {HPEC (IEEE High Performance Extreme Computing)},
      year = {2017}
    }
    

2016

  1. DBMS Data Loading: An Analysis on Modern Hardware
    Adam Dziedzic, Manos Karpathiotakis, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki
    In ADMS (Accelerating analytics and Data Management Systems) 2016

    Paper Slides

    Data loading has traditionally been considered a one-time deal - an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the one-off assumption, and has turned data loading into a major bottleneck of the data analysis pipeline. This paper analyzes the behavior of modern DBMSs in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMSs, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMSs are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.
    @inproceedings{dziedzic2016dbms,
      title = {DBMS Data Loading: An Analysis on Modern Hardware},
      author = {Dziedzic, Adam and Karpathiotakis, Manos and Alagiannis, Ioannis and Appuswamy, Raja and Ailamaki, Anastasia},
      booktitle = {ADMS (Accelerating analytics and Data Management Systems)},
      year = {2016}
    }
    
  2. Data Transformation and Migration in Polystores
    Adam Dziedzic, Aaron Elmore, Michael Stonebraker
    In HPEC (IEEE High Performance Extreme Computing) 2016

    Paper Poster Slides

    Ever increasing data size and new requirements in data processing has fostered the development of many new database systems. The result is that many data-intensive applications are underpinned by different engines. To enable data mobility there is a need to transfer data between systems easily and efficiently. We analyze the state-of-the-art of data migration and outline research opportunities for a rapid data transfer. Our experiments explore data migration between a diverse set of databases, including PostgreSQL, SciDB, S-Store and Accumulo. Each of the systems excels at specific application requirements, such as transactional processing, numerical computation, streaming data, and large scale text processing. Providing an efficient data migration tool is essential to take advantage of superior processing from that specialized databases. Our goal is to build such a data migration framework that will take advantage of recent advancement in hardware and software.
    @inproceedings{dziedzic2016transformation,
      title = {Data Transformation and Migration in Polystores},
      author = {Dziedzic, Adam and Elmore, Aaron and Stonebraker, Michael},
      booktitle = {HPEC (IEEE High Performance Extreme Computing)},
      year = {2016},
      organization = {IEEE}
    }
    
  1. Integrating Real-Time and Batch Processing in a Polystore
    John Meehan, Stan Zdonik, Shaobo Tian, Yulong Tian, Nesime Tatbul, Adam Dziedzic, Aaron Elmore
    In HPEC (IEEE High Performance Extreme Computing) 2016

    Paper

    This paper describes a stream processing engine called S-Store and its role in the BigDAWG polystore. Fundamentally, S-Store acts as a frontend processor that accepts input from multiple sources, and massages it into a form that has eliminated errors (data cleaning) and translates that input into a form that can be efficiently ingested into BigDAWG. S-Store also acts as an intelligent router that sends input tuples to the appropriate components of BigDAWG. All updates to S-Store’s shared memory are done in a transactionally consistent (ACID) way, thereby eliminating new errors caused by non-synchronized reads and writes. The ability to migrate data from component to component of BigDAWG is crucial. We have described a migrator from S-Store to Postgres that we have implemented as a first proof of concept. We report some interesting results using this migrator that impact the evaluation of query plans.
    @inproceedings{meehan2016integrating,
      title = {Integrating Real-Time and Batch Processing in a Polystore},
      author = {Meehan, John and Zdonik, Stan and Tian, Shaobo and Tian, Yulong and Tatbul, Nesime and Dziedzic, Adam and Elmore, Aaron},
      booktitle = {HPEC (IEEE High Performance Extreme Computing)},
      year = {2016}
    }
    

2015

  1. BigDAWG: a Polystore for Diverse Interactive Applications
    Adam Dziedzic, Jennie Duggan, Aaron J. Elmore, Vijay Gadepally, Michael Stonebraker
    In DSIA (IEEE Viz Data Systems for Interactive Analysis) 2015

    Paper

    Interactive analytics requires low latency queries in the presence of diverse, complex, and constantly evolving workloads. To address these challenges, we introduce a polystore, BigDAWG, that tightly couples diverse database systems, data models, and query languages through use of semantically grouped Islands of Information. BigDAWG, which stands for the Big Data Working Group, seeks to provide location transparency by matching the right system for each workload using black-box model of query and system performance. In this paper we introduce BigDAWG as a solution to diverse web-based interactive applications and motivate our key challenges in building BigDAWG. BigDAWG continues to evolve and, where applicable, we have noted the current status of its implementation.
    @inproceedings{dziedzic2015bigdawg,
      title = {BigDAWG: a Polystore for Diverse Interactive Applications},
      author = {Dziedzic, Adam and Duggan, Jennie and Elmore, Aaron J. and Gadepally, Vijay and Stonebraker, Michael},
      booktitle = {DSIA (IEEE Viz Data Systems for Interactive Analysis)},
      year = {2015}
    }
    

    2014

      1. Analysis and comparison of NoSQL databases with an introduction to consistent references in Big Data storage systems
        Adam Dziedzic, Jan Mulawka
        In Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2014 2014

        Paper

        NoSQL is a new approach to data storage and manipulation. The aim of this paper is to gain more insight into NoSQL databases, as we are still in the early stages of understanding when to use them and how to use them in an appropriate way. In this submission descriptions of selected NoSQL databases are presented. Each of the databases is analysed with primary focus on its data model, data access, architecture and practical usage in real applications. Furthemore, the NoSQL databases are compared in fields of data references. The relational databases offer foreign keys, whereas NoSQL databases provide us with limited references. An intermediate model between graph theory and relational algebra which can address the problem should be created. Finally, the proposal of a new approach to the problem of inconsistent references in Big Data storage systems is introduced.
        @inproceedings{dziedzic2014analysis,
          title = {Analysis and comparison of NoSQL databases with an introduction to consistent references in Big Data storage systems},
          author = {Dziedzic, Adam and Mulawka, Jan},
          booktitle = {Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2014},
          volume = {9290},
          pages = {92902V},
          year = {2014},
          organization = {International Society for Optics and Photonics}
        }