Adam Dziedzic

I am a postdoctoral researcher at the Vector Institute and the University of Toronto, a member of the CleverHans Lab, advised by Prof. Nicolas Papernot. I earned my PhD at the University of Chicago, where I was advised by Prof. Sanjay Krishnan and worked on the Band-Limited convolutional neural networks, the DeepLens project as well as the out-of-distribution robustness of pre-trained transformers. My previous research was focused on data loading and migration between diverse database systems within the framework of the BigDAWG project. I obtained my Bachelor's and Master's degrees from Warsaw University of Technology in Poland. I was also studying at DTU (Technical University of Denmark) and carried out research on databases in the DIAS group at EPFL, Switzerland. I was a PhD intern at Microsoft Research in the Data Management, Exploration and Mining (DMX) group, advised by Vivek Narasayya, and worked on recommendation of hybrid physical designs (B+ tree and Columnstores) for SQL Server. I also had internships at CERN (Geneva, Switzerland), Barclays Investment Bank in London (UK), Microsoft Research (Redmond, USA) and Google (Madison, USA). More information about my research can be found in the research statement.

Selected Publications

Please also check the Full List of my publications and the Google Scholar profile.

  1. CaPC Learning: Confidential and Private Collaborative Learning
    Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, Xiao Wang
    In ICLR (International Conference on Learning Representations) 2021

    Paper Slides Video Code Blog Post

    Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other’s data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
    @inproceedings{capc2021iclr,
      title = {CaPC Learning: Confidential and Private Collaborative Learning},
      author = {Choquette-Choo, Christopher A. and Dullerud, Natalie and Dziedzic, Adam and Zhang, Yunxiang and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      booktitle = {ICLR (International Conference on Learning Representations)},
      year = {2021}
    }
    
  2. Pretrained Transformers Improve Out-of-Distribution Robustness
    Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song
    In ACL (Association for Computational Linguistics) 2020

    Paper Slides Video Code

    Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers’ performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
    @inproceedings{hendrycks-etal-2020-pretrained,
      title = {Pretrained Transformers Improve Out-of-Distribution Robustness},
      author = {Hendrycks, Dan and Liu, Xiaoyuan and Wallace, Eric and Dziedzic, Adam and Krishnan, Rishabh and Song, Dawn},
      booktitle = { ACL (Association for Computational Linguistics)},
      month = jul,
      year = {2020},
      address = {Online},
      publisher = {ACL (Association for Computational Linguistics)},
      doi = {10.18653/v1/2020.acl-main.244},
      pages = {2744--2751}
    }
    
  3. Band-limited Training and Inference for Convolutional Neural Networks
    Adam Dziedzic, Ioannis Paparizzos, Sanjay Krishnan, Aaron Elmore, Michael Franklin
    In ICML (International Conference on Machine Learning) 2019

    Paper Slides Video

    The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.
    @inproceedings{dziedzic2019band,
      title = {Band-limited Training and Inference for Convolutional Neural Networks},
      author = {Dziedzic, Adam and Paparizzos, Ioannis and Krishnan, Sanjay and Elmore, Aaron and Franklin, Michael},
      booktitle = {ICML (International Conference on Machine Learning)},
      year = {2019}
    }
    
  4. Columnstore and B+ Tree - Are Hybrid Physical Designs Important?
    Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, Manoj Syamala
    In SIGMOD (ACM Special Interest Group on Management of Data) 2018

    Paper Slides

    Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied — a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.
    @inproceedings{dziedzic2018index,
      title = {Columnstore and B+ Tree - Are Hybrid Physical Designs Important?},
      author = {Dziedzic, Adam and Wang, Jingjing and Das, Sudipto and Ding, Bolin and Narasayya, Vivek R. and Syamala, Manoj},
      booktitle = {SIGMOD (ACM Special Interest Group on Management of Data)},
      year = {2018}
    }
    
  5. Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.
    Tim Mattson, Vijay Gadepally, Zuohao She, Adam Dziedzic, Jeff Parkhurst
    In CIDR (Conference on Innovative Data Systems Research) 2017

    Paper

    In most Big Data applications, the data is heterogeneous. As we have been arguing in a series of papers, storage engines should be well suited to the data they hold. Therefore, a system supporting Big Data applications should be able to expose multiple storage engines through a single interface. We call such systems, polystore systems. Our reference implementation of the polystore concept is called BigDAWG (short for the Big Data Analytics Working Group). In this demonstration, we will show the BigDAWG system and a number of polystore applications built to help ocean metage-nomics researchers handle their heterogenous Big Data.
    @inproceedings{mattson2017demonstrating,
      title = {Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.},
      author = {Mattson, Tim and Gadepally, Vijay and She, Zuohao and Dziedzic, Adam and Parkhurst, Jeff},
      booktitle = {CIDR (Conference on Innovative Data Systems Research)},
      year = {2017}
    }
    
  6. DBMS Data Loading: An Analysis on Modern Hardware
    Adam Dziedzic, Manos Karpathiotakis, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki
    In ADMS (Accelerating analytics and Data Management Systems) 2016

    Paper Slides

    Data loading has traditionally been considered a one-time deal - an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the one-off assumption, and has turned data loading into a major bottleneck of the data analysis pipeline. This paper analyzes the behavior of modern DBMSs in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMSs, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMSs are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.
    @inproceedings{dziedzic2016dbms,
      title = {DBMS Data Loading: An Analysis on Modern Hardware},
      author = {Dziedzic, Adam and Karpathiotakis, Manos and Alagiannis, Ioannis and Appuswamy, Raja and Ailamaki, Anastasia},
      booktitle = {ADMS (Accelerating analytics and Data Management Systems)},
      year = {2016}
    }
    
  7. Data Transformation and Migration in Polystores
    Adam Dziedzic, Aaron Elmore, Michael Stonebraker
    In HPEC (IEEE High Performance Extreme Computing) 2016

    Paper Poster Slides

    Ever increasing data size and new requirements in data processing has fostered the development of many new database systems. The result is that many data-intensive applications are underpinned by different engines. To enable data mobility there is a need to transfer data between systems easily and efficiently. We analyze the state-of-the-art of data migration and outline research opportunities for a rapid data transfer. Our experiments explore data migration between a diverse set of databases, including PostgreSQL, SciDB, S-Store and Accumulo. Each of the systems excels at specific application requirements, such as transactional processing, numerical computation, streaming data, and large scale text processing. Providing an efficient data migration tool is essential to take advantage of superior processing from that specialized databases. Our goal is to build such a data migration framework that will take advantage of recent advancement in hardware and software.
    @inproceedings{dziedzic2016transformation,
      title = {Data Transformation and Migration in Polystores},
      author = {Dziedzic, Adam and Elmore, Aaron and Stonebraker, Michael},
      booktitle = {HPEC (IEEE High Performance Extreme Computing)},
      year = {2016},
      organization = {IEEE}
    }
    
  8. BigDAWG: a Polystore for Diverse Interactive Applications
    Adam Dziedzic, Jennie Duggan, Aaron J. Elmore, Vijay Gadepally, Michael Stonebraker
    In DSIA (IEEE Viz Data Systems for Interactive Analysis) 2015

    Paper

    Interactive analytics requires low latency queries in the presence of diverse, complex, and constantly evolving workloads. To address these challenges, we introduce a polystore, BigDAWG, that tightly couples diverse database systems, data models, and query languages through use of semantically grouped Islands of Information. BigDAWG, which stands for the Big Data Working Group, seeks to provide location transparency by matching the right system for each workload using black-box model of query and system performance. In this paper we introduce BigDAWG as a solution to diverse web-based interactive applications and motivate our key challenges in building BigDAWG. BigDAWG continues to evolve and, where applicable, we have noted the current status of its implementation.
    @inproceedings{dziedzic2015bigdawg,
      title = {BigDAWG: a Polystore for Diverse Interactive Applications},
      author = {Dziedzic, Adam and Duggan, Jennie and Elmore, Aaron J. and Gadepally, Vijay and Stonebraker, Michael},
      booktitle = {DSIA (IEEE Viz Data Systems for Interactive Analysis)},
      year = {2015}
    }
    

Research Talks

Experience

University of Toronto & Vector Institute

September 2020 - current: Postdoctoral researcher

Research on collaborative, private, and robust Machine Learning.

University of Chicago

July 2015 - August 2020: PhD Student

Research on the intersection of robust machine learning and database management systems (DBMSs).

Google

June - September 2017: PhD Software Engineering Intern at Data Infrastructure and Analysis team

Research on graceful degradation and avoidance of performance cliffs in the F1 system.

Microsoft Research

March - June 2015: Research Intern at Data Management, Exploration and Mining group

Carried out research on hybrid physical designs for diverse workloads.

EPFL

October 2014 - June 2015: Research Intern

Research on data loading to diverse database management systems.

Warsaw University of Technology

October 2007 - September 2014: Bachelor and Master's Student

I was granted the academic scholarship for the best faculty students (based on GPA).

Barclays Investment Bank

June - August 2013: Intern Analyst

Created a system for validating and suggesting underlyings for complex financial products.

CERN

April - December 2012: Technical Student at IT Department

Designed a system to store information on configuration and management of devices at computer center.

Mobile Startup

March 2012: Udarnik

Worked on an application providing aspects of music social interactions.

Technical University of Denmark

August 2010 - January 2011: Erasmus Student

Applied statistics, Web 2.0 and mobile interactions, spatial databases, logic programming.

Tekten

July 2010: Designer and Software Engineer

Designed a database and developed application for a telecom company in Java and PL/SQL.

Torn

July - September 2009: Software Engineer

Worked on a financial and accounting system project in Java and Oracle 10g.

Projects

Collaborative Learning in ML

Collaborative Learning in ML

Confidential and Private Collaborative (CaPC) learning is the first method provably achieving both confidentiality and privacy in a collaborative setting by using techniques from cryptography and differential privacy literature.

...   ...
paperpaper slidesslides talktalk bibtexbibtex
Band-limited Training and Inference For Convolutional Nerual Networks

Band-limited Training and Inference For Convolutional Nerual Networks

The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.

...   ...
paperpaper slidesslides talktalk bibtexbibtex
Auto-recommendation of hybrid physical designs

Auto-recommendation of hybrid physical designs

We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.

paperpaper slidesslides bibtexbibtex
BigDAWG

BigDAWG

An open source project from researchers within the Intel Science and Technology Center for Big Data (ISTC). BigDAWG is a reference implementation of a polystore database. A polystore system is any database management system (DBMS) that is built on top of multiple, heterogeneous, integrated storage engines. I worked on the scaffolding of the system and then implemented a cast operator to move data between diverse DBMSs.

...   ...
paperpaper slidesslides bibtexbibtex
Data Loading

Data Loading

An automated testing infrastructure was built to benchmark the loading performance of several commercial and open-source databases, perform an in-depth analysis to identify bottlenecks of the data loading process and investigate novel techniques which could be used to accelerate DBMS data loading.

paperpaper slidesslides bibtexbibtex

Contact

Adam Dziedzic

Postdoctoral Fellow at the Vector Institute and the University of Toronto.

Office: Vector Institute, Toronto

The best way to contact me is through email.

© 2022 Adam Dziedzic