Publications

An up-to-date list is available on Google Scholar

2024

  1. Memorization in Self-Supervised Learning Improves Downstream Generalization
    Wenhao Wang, Muhammad Ahmad Kaleem, Adam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch
    In The Twelfth International Conference on Learning Representations (ICLR) 2024

    Paper Code

    Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data—often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations—both known in supervised learning as regularization techniques that reduce overfitting—still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.
    @inproceedings{wang2024memorization,
      title = {Memorization in Self-Supervised Learning Improves Downstream Generalization},
      author = {Wang, Wenhao and Kaleem, Muhammad Ahmad and Dziedzic, Adam and Backes, Michael and Papernot, Nicolas and Boenisch, Franziska},
      booktitle = {The Twelfth International Conference on Learning Representations (ICLR)},
      year = {2024}
    }
    
  2. Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data
    Congyu Fang, Adam Dziedzic, Lin Zhang, Laura Oliva, Amol Verma, Fahad Razak, Nicolas Papernot, Bo Wang
    In eBioMedicine 2024

    Paper

    Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.
    @inproceedings{fang2024collaborative,
      title = {Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data},
      author = {Fang, Congyu and Dziedzic, Adam and Zhang, Lin and Oliva, Laura and Verma, Amol and Razak, Fahad and Papernot, Nicolas and Wang, Bo},
      booktitle = {eBioMedicine},
      year = {2024}
    }
    
  3. Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives
    Vincent Hanke, Tom Blanchard, Franziska Boenisch, Iyiola Emmanuel Olatunji, Michael Backes, Adam Dziedzic
    In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024

    Paper

    While open Large Language Models (LLMs) have made significant progress, they still fall short of matching the performance of their closed, proprietary counterparts, making the latter attractive even for the use on highly private data. Recently, various new methods have been proposed to adapt closed LLMs to private data without leaking private information to third parties and/or the LLM provider. In this work, we analyze the privacy protection and performance of the four most recent methods for private adaptation of closed LLMs. By examining their threat models and thoroughly comparing their performance under different privacy levels according to differential privacy (DP), various LLM architectures, and multiple datasets for classification and generation tasks, we find that: (1) all the methods leak query data, i.e., the (potentially sensitive) user data that is queried at inference time, to the LLM provider, (2) three out of four methods also leak large fractions of private training data to the LLM provider while the method that protects private data requires a local open LLM, (3) all the methods exhibit lower performance compared to three private gradient-based adaptation methods for local open LLMs, and (4) the private adaptation methods for closed LLMs incur higher monetary training and query costs than running the alternative methods on local open LLMs. This yields the conclusion that, to achieve truly privacy-preserving LLM adaptations that yield high performance and more privacy at lower costs, taking into account current methods and models, one should use open LLMs.
    @inproceedings{hanke2024openLLMs,
      title = {Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives},
      author = {Hanke, Vincent and Blanchard, Tom and Boenisch, Franziska and Olatunji, Iyiola Emmanuel and Backes, Michael and Dziedzic, Adam},
      year = {2024},
      booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
    }
    
  4. LLM Dataset Inference: Did you train on my dataset?
    Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic
    In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024

    Paper

    The proliferation of large language models (LLMs) in the real world has come with a rise in copyright cases against companies for training their models on unlicensed data from the internet. Recent works have presented methods to identify if individual text sequences were members of the model’s training data, known as membership inference attacks (MIAs). We demonstrate that the apparent success of these MIAs is confounded by selecting non-members (text sequences not used for training) belonging to a different distribution from the members (e.g., temporally shifted recent Wikipedia articles compared with ones used to train the model). This distribution shift makes membership inference appear successful. However, most MIA methods perform no better than random guessing when discriminating between members and non-members from the same distribution (e.g., in this case, the same period of time). Even when MIAs work, we find that different MIAs succeed at inferring membership of samples from different distributions. Instead, we propose a new dataset inference method to accurately identify the datasets used to train large language models. This paradigm sits realistically in the modern-day copyright landscape, where authors claim that an LLM is trained over multiple documents (such as a book) written by them, rather than one particular paragraph. While dataset inference shares many of the challenges of membership inference, we solve it by selectively combining the MIAs that provide positive signal for a given distribution, and aggregating them to perform a statistical test on a given dataset. Our approach successfully distinguishes the train and test sets of different subsets of the Pile with statistically significant p-values < 0.1, without any false positives.
    @inproceedings{maini2024LLMDatasetInference,
      title = {LLM Dataset Inference: Did you train on my dataset?},
      author = {Maini, Pratyush and Jia, Hengrui and Papernot, Nicolas and Dziedzic, Adam},
      year = {2024},
      booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
    }
    
  5. Localizing Memorization in SSL Vision Encoders
    Wenhao Wang, Adam Dziedzic, Michael Backes, Franziska Boenisch
    In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024

    Paper

    Recent work on studying memorization in self-supervised learning (SSL) suggests that even though SSL encoders are trained on millions of images, they still memorize individual data points. While effort has been put into characterizing the memorized data and linking encoder memorization to downstream utility, little is known about where the memorization happens inside SSL encoders. To close this gap, we propose two metrics for localizing memorization in SSL encoders on a per-layer (layermem) and per-unit basis (unitmem). Our localization methods are independent of the downstream task, do not require any label information, and can be performed in a forward pass. By localizing memorization in various encoder architectures (convolutional and transformer-based) trained on diverse datasets with contrastive and non-contrastive SSL frameworks, we find that (1) while SSL memorization increases with layer depth, highly memorizing units are distributed across the entire encoder, (2) a significant fraction of units in SSL encoders experiences surprisingly high memorization of individual data points, which is in contrast to models trained under supervision, (3) atypical (or outlier) data points cause much higher layer and unit memorization than standard data points, and (4) in vision transformers, most memorization happens in the fully-connected layers. Finally, we show that localizing memorization in SSL has the potential to improve fine-tuning and to inform pruning strategies.
    @inproceedings{wang2024LocalizeMemorizationSSL,
      title = {Localizing Memorization in SSL Vision Encoders},
      author = {Wang, Wenhao and Dziedzic, Adam and Backes, Michael and Boenisch, Franziska},
      year = {2024},
      booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
    }
    
  6. Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models
    Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch
    In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024

    Paper

    Diffusion models (DMs) produce very detailed and high-quality images. Their power results from extensive training on large amounts of data, usually scraped from the internet without proper attribution or consent from content creators. Unfortunately, this practice raises privacy and intellectual property concerns, as DMs can memorize and later reproduce their potentially sensitive or copyrighted training images at inference time. Prior efforts prevent this issue by either changing the input to the diffusion process, thereby preventing the DM from generating memorized samples during inference, or removing the memorized data from training altogether. While those are viable solutions when the DM is developed and deployed in a secure and constantly monitored environment, they hold the risk of adversaries circumventing the safeguards and are not effective when the DM itself is publicly released. To solve the problem, we introduce NeMo, the first method to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers. Through our experiments, we make the intriguing finding that in many cases, single neurons are responsible for memorizing particular training samples. By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data. In this way, our NeMo contributes to a more responsible deployment of DMs.
    @inproceedings{hintersdorf2024MemorizationDiffusionModels,
      title = {Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models},
      author = {Hintersdorf, Dominik and Struppek, Lukas and Kersting, Kristian and Dziedzic, Adam and Boenisch, Franziska},
      year = {2024},
      booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
    }
    

2023

  1. Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees
    Franziska Boenisch, Christopher Mühl, Roy Rinberg, Jannis Ihrig, Adam Dziedzic
    In Privacy Enhancing Technologies Symposium (PETS) 2023

    Paper

    Applying machine learning (ML) to sensitive domains requires privacy protection of the underlying training data through formal privacy frameworks, such as differential privacy (DP). Yet, usually, the privacy of the training data comes at the cost of the resulting ML models’ utility. One reason for this is that DP uses one uniform privacy budget epsilon for all training data points, which has to align with the strictest privacy requirement encountered among all data holders. In practice, different data holders have different privacy requirements and data points of data holders with lower requirements can contribute more information to the training process of the ML models. To account for this need, we propose two novel methods based on the Private Aggregation of Teacher Ensembles (PATE) framework to support the training of ML models with individualized privacy guarantees. We formally describe the methods, provide a theoretical analysis of their privacy bounds, and experimentally evaluate their effect on the final model’s utility using the MNIST, SVHN, and Adult income datasets. Our empirical results show that the individualized privacy methods yield ML models of higher accuracy than the non-individualized baseline. Thereby, we improve the privacy-utility trade-off in scenarios in which different data holders consent to contribute their sensitive data at different individual privacy levels.
    @inproceedings{pate2023pets,
      author = {Boenisch, Franziska and Mühl, Christopher and Rinberg, Roy and Ihrig, Jannis and Dziedzic, Adam},
      title = {Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees},
      booktitle = {Privacy Enhancing Technologies Symposium (PETS)},
      year = {2023}
    }
    
  2. Private Multi-Winner Voting for Machine Learning
    Adam Dziedzic, Christopher A Choquette-Choo, Natalie Dullerud, Vinith Menon Suriyakumar, Ali Shahin Shamsabadi, Muhammad Ahmad Kaleem, Somesh Jha, Nicolas Papernot, Xiao Wang
    In Privacy Enhancing Technologies Symposium (PETS) 2023

    Paper Slides Video

    Private multi-winner voting is the task of revealing k-hot binary vectors satisfying a bounded differential privacy (DP) guarantee. This task has been understudied in machine learning literature despite its prevalence in many domains such as healthcare. We propose three new DP multi-winner mechanisms: Binary, τ, and Powerset voting. Binary voting operates independently per label through composition. τ voting bounds votes optimally in their ℓ2 norm for tight data-independent guarantees. Powerset voting operates over the entire binary vector by viewing the possible outcomes as a power set. Our theoretical and empirical analysis shows that Binary voting can be a competitive mechanism on many tasks unless there are strong correlations between labels, in which case Powerset voting outperforms it. We use our mechanisms to enable privacy-preserving multi-label learning in the central setting by extending the canonical single-label technique: PATE. We find that our techniques outperform current state-of-the-art approaches on large, real-world healthcare data and standard multi-label benchmarks. We further enable multi-label confidential and private collaborative (CaPC) learning and show that model performance can be significantly improved in the multi-site setting.
    @inproceedings{multilabel2023pets,
      title = {Private Multi-Winner Voting for Machine Learning},
      author = {Dziedzic, Adam and Choquette-Choo, Christopher A and Dullerud, Natalie and Suriyakumar, Vinith Menon and Shamsabadi, Ali Shahin and Kaleem, Muhammad Ahmad and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      booktitle = {Privacy Enhancing Technologies Symposium (PETS)},
      year = {2023}
    }
    
  3. Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders
    Jan Dubiński, Stanisław Pawlak, Franziska Boenisch, Tomasz Trzcinski, Adam Dziedzic
    In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

    Paper Poster Slides Video Code

    Machine Learning as a Service (MLaaS) APIs provide ready-to-use and high-utility encoders that generate vector representations for given inputs. Since these encoders are very costly to train, they become lucrative targets for model stealing attacks during which an adversary leverages query access to the API to replicate the encoder locally at a fraction of the original training costs. We propose *Bucks for Buckets (B4B)*, the first *active defense* that prevents stealing while the attack is happening without degrading representation quality for legitimate API users. Our defense relies on the observation that the representations returned to adversaries who try to steal the encoder’s functionality cover a significantly larger fraction of the embedding space than representations of legitimate users who utilize the encoder to solve a particular downstream task. B4B leverages this to adaptively adjust the utility of the returned representations according to a user’s coverage of the embedding space. To prevent adaptive adversaries from eluding our defense by simply creating multiple user accounts (sybils), B4B also individually transforms each user’s representations. This prevents the adversary from directly aggregating representations over multiple accounts to create their stolen encoder copy. Our active defense opens a new path towards securely sharing and democratizing encoders over public APIs.
    @inproceedings{dubinski2023bucks,
      title = {Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders},
      author = {Dubiński, Jan and Pawlak, Stanisław and Boenisch, Franziska and Trzcinski, Tomasz and Dziedzic, Adam},
      booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
      year = {2023}
    }
    
  4. Robust and Actively Secure Serverless Collaborative Learning
    Nicholas Franzese, Adam Dziedzic, Christopher A. Choquette-Choo, Mark R. Thomas, Muhammad Ahmad Kaleem, Stephan Rabanser, Congyu Fang, Somesh Jha, Nicolas Papernot, Xiao Wang
    In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

    Paper Poster

    Collaborative machine learning (ML) is widely used to enable institutions to learn better models from distributed data. While collaborative approaches to learning intuitively protect user data, they remain vulnerable to either the server, the clients, or both, deviating from the protocol. Indeed, because the protocol is asymmetric, a malicious server can abuse its power to reconstruct client data points. Conversely, malicious clients can corrupt learning with malicious updates. Thus, both clients and servers require a guarantee when the other cannot be trusted to fully cooperate. In this work, we propose a peer-to-peer (P2P) learning scheme that is secure against malicious servers and robust to malicious clients. Our core contribution is a generic framework that transforms any (compatible) algorithm for robust aggregation of model updates to the setting where servers and clients can act maliciously. Finally, we demonstrate the computational efficiency of our approach even with 1-million parameter models trained by 100s of peers on standard datasets.
    @inproceedings{franzeses2023p2pml,
      title = {Robust and Actively Secure Serverless Collaborative Learning},
      author = {Franzese, Nicholas and Dziedzic, Adam and Choquette-Choo, Christopher A. and Thomas, Mark R. and Kaleem, Muhammad Ahmad and Rabanser, Stephan and Fang, Congyu and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
      year = {2023}
    }
    
  5. Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models
    Haonan Duan, Adam Dziedzic, Nicolas Papernot, Franziska Boenisch
    In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

    Paper Slides Video Code

    Large language models (LLMs) are excellent in-context learners. However, the sensitivity of data contained in prompts raises privacy concerns. Our work first shows that these concerns are valid: we instantiate a simple but highly effective membership inference attack against the data used to prompt LLMs. To address this vulnerability, one could forego prompting and resort to fine-tuning LLMs with known algorithms for private gradient descent. However, this comes at the expense of the practicality and efficiency offered by prompting. Therefore, we propose to privately learn to prompt. We first show that soft prompts can be obtained privately through gradient descent on downstream data. However, this is not the case for discrete prompts. Thus, we orchestrate a noisy vote among an ensemble of LLMs presented with different prompts, i.e., a flock of stochastic parrots. The vote privately transfers the flock’s knowledge into a single public prompt. We show that LLMs prompted with our private algorithms closely match the non-private baselines. For example, using GPT3 as the base model, we achieve a downstream accuracy of 92.7% on the sst2 dataset with strong differential privacy guarantees vs. 95.2% for the non-private baseline. Through our experiments, we also show that our prompt-based approach is easily deployed with existing commercial APIs.
    @inproceedings{duan2023flocks,
      title = {Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models},
      author = {Duan, Haonan and Dziedzic, Adam and Papernot, Nicolas and Boenisch, Franziska},
      booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
      year = {2023}
    }
    
  6. Have it your way: Individualized Privacy Assignment for DP-SGD
    Franziska Boenisch, Christopher Mühl, Adam Dziedzic, Roy Rinberg, Nicolas Papernot
    In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

    Paper

    When training a machine learning model with differential privacy, one sets a privacy budget. This budget represents a maximal privacy violation that any user is willing to face by contributing their data to the training set. We argue that this approach is limited because different users may have different privacy expectations. Thus, setting a uniform privacy budget across all points may be overly conservative for some users or, conversely, not sufficiently protective for others. In this paper, we capture these preferences through individualized privacy budgets. To demonstrate their practicality, we introduce a variant of Differentially Private Stochastic Gradient Descent (DP-SGD) which supports such individualized budgets. DP-SGD is the canonical approach to training models with differential privacy. We modify its data sampling and gradient noising mechanisms to arrive at our approach, which we call Individualized DP-SGD (IDP-SGD). Because IDP-SGD provides privacy guarantees tailored to the preferences of individual users and their data points, we find it empirically improves privacy-utility trade-offs.
    @inproceedings{boenisch2023idpsgd,
      title = {Have it your way: Individualized Privacy Assignment for DP-SGD},
      author = {Boenisch, Franziska and Mühl, Christopher and Dziedzic, Adam and Rinberg, Roy and Papernot, Nicolas},
      year = {2023},
      booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
      eprint = {2303.17046},
      archiveprefix = {arXiv},
      primaryclass = {cs.LG}
    }
    
  7. On the privacy risk of in-context learning
    Haonan Duan, Adam Dziedzic, Mohammad Yaghini, Nicolas Papernot, Franziska Boenisch
    In The 61st Annual Meeting Of The Association For Computational Linguistics 2023

    Paper

    Large language models (LLMs) are excellent few-shot learners. They can perform a wide variety of tasks purely based on natural language prompts provided to them. These prompts contain data of a specific downstream task—often the private dataset of a party, e.g., a company that wants to leverage the LLM on their purposes. We show that deploying prompted models presents a significant privacy risk for the data used within the prompt by instantiating a highly effective membership inference attack. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels. After identifying the model’s sensitivity to their prompts—in form of a significantly higher prediction confidence on the prompted data—as a cause for the increased risk, we propose ensembling as a mitigation strategy. By aggregating over multiple different versions of a prompted model, membership inference risk can be decreased.
    @inproceedings{duan2023privacyICL,
      title = {On the privacy risk of in-context learning},
      author = {Duan, Haonan and Dziedzic, Adam and Yaghini, Mohammad and Papernot, Nicolas and Boenisch, Franziska},
      booktitle = {The 61st Annual Meeting Of The Association For Computational Linguistics},
      year = {2023}
    }
    

2022

  1. Increasing the Cost of Model Extraction with Calibrated Proof of Work
    Adam Dziedzic, Muhammad Ahmad Kaleem, Yu Shen Lu, Nicolas Papernot
    In ICLR (International Conference on Learning Representations) [SPOTLIGTH] 2022

    Paper Slides Video Code Blog Post

    In model extraction attacks, adversaries can steal a machine learning model exposed via a public API by repeatedly querying it and adjusting their own model based on obtained predictions. To prevent model stealing, existing defenses focus on detecting malicious queries, truncating, or distorting outputs, thus necessarily introducing a tradeoff between robustness and model utility for legitimate users. Instead, we propose to impede model extraction by requiring users to complete a proof-of-work before they can read the model’s predictions. This deters attackers by greatly increasing (even up to 100x) the computational effort needed to leverage query access for model extraction. Since we calibrate the effort required to complete the proof-of-work to each query, this only introduces a slight overhead for regular users (up to 2x). To achieve this, our calibration applies tools from differential privacy to measure the information revealed by a query. Our method requires no modification of the victim model and can be applied by machine learning practitioners to guard their publicly exposed models against being easily stolen.
    @inproceedings{pow2022iclr,
      title = {Increasing the Cost of Model Extraction with Calibrated Proof of Work},
      author = {Dziedzic, Adam and Kaleem, Muhammad Ahmad and Lu, Yu Shen and Papernot, Nicolas},
      booktitle = {ICLR (International Conference on Learning Representations) [SPOTLIGTH]},
      year = {2022}
    }
    
  2. On the Difficulty of Defending Self-Supervised Learning against Model Extraction
    Adam Dziedzic, Nikita Dhawan, Muhammad Ahmad Kaleem, Jonas Guan, Nicolas Papernot
    In ICML (International Conference on Machine Learning) 2022

    Paper

    Self-Supervised Learning (SSL) is an increasingly popular ML paradigm that trains models to transform complex inputs into representations without relying on explicit labels. These representations encode similarity structures that enable efficient learning of multiple downstream tasks. Recently, ML-as-a-Service providers have commenced offering trained SSL models over inference APIs, which transform user inputs into useful representations for a fee. However, the high cost involved to train these models and their exposure over APIs both make black-box extraction a realistic security threat. We thus explore model stealing attacks against SSL. Unlike traditional model extraction on classifiers that output labels, the victim models here output representations; these representations are of significantly higher dimensionality compared to the low-dimensional prediction scores output by classifiers. We construct several novel attacks and find that approaches that train directly on a victim’s stolen representations are query efficient and enable high accuracy for downstream models. We then show that existing defenses against model extraction are inadequate and not easily retrofitted to the specificities of SSL.
    @inproceedings{sslextractions2022icml,
      title = {On the Difficulty of Defending Self-Supervised Learning against Model Extraction},
      author = {Dziedzic, Adam and Dhawan, Nikita and Kaleem, Muhammad Ahmad and Guan, Jonas and Papernot, Nicolas},
      booktitle = {ICML (International Conference on Machine Learning)},
      year = {2022}
    }
    
  3. Dataset Inference for Self-Supervised Models
    Adam Dziedzic, Haonan Duan, Muhammad Ahmad Kaleem, Nikita Dhawan, Jonas Guan, Yannis Cattan, Franziska Boenisch, Nicolas Papernot
    In NeurIPS (Neural Information Processing Systems) 2022

    Paper Slides Video

    Self-supervised models are increasingly prevalent in machine learning (ML) since they reduce the need for expensively labeled data. Because of their versatility in downstream applications, they are increasingly used as a service exposed via public APIs. At the same time, these encoder models are particularly vulnerable to model stealing attacks due to the high dimensionality of vector representations they output. Yet, encoders remain undefended: existing mitigation strategies for stealing attacks focus on supervised learning. We introduce a new dataset inference defense, which uses the private training set of the victim encoder model to attribute its ownership in the event of stealing. The intuition is that the log-likelihood of an encoder’s output representations is higher on the victim’s training data than on test data if it is stolen from the victim, but not if it is independently trained. We compute this log-likelihood using density estimation models. As part of our evaluation, we also propose measuring the fidelity of stolen encoders and quantifying the effectiveness of the theft detection without involving downstream tasks; instead, we leverage mutual information and distance measurements. Our extensive empirical results in the vision domain demonstrate that dataset inference is a promising direction for defending self-supervised models against model stealing.
    @inproceedings{datasetinference2022neurips,
      title = {Dataset Inference for Self-Supervised Models},
      author = {Dziedzic, Adam and Duan, Haonan and Kaleem, Muhammad Ahmad and Dhawan, Nikita and Guan, Jonas and Cattan, Yannis and Boenisch, Franziska and Papernot, Nicolas},
      booktitle = {NeurIPS (Neural Information Processing Systems)},
      year = {2022}
    }
    

2021

  1. CaPC Learning: Confidential and Private Collaborative Learning
    Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, Xiao Wang
    In ICLR (International Conference on Learning Representations) 2021

    Paper Slides Video Code Blog Post

    Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other’s data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
    @inproceedings{capc2021iclr,
      title = {CaPC Learning: Confidential and Private Collaborative Learning},
      author = {Choquette-Choo, Christopher A. and Dullerud, Natalie and Dziedzic, Adam and Zhang, Yunxiang and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      booktitle = {ICLR (International Conference on Learning Representations)},
      year = {2021}
    }
    
  2. Preoperative paraspinal neck muscle characteristics predict early onset adjacent segment degeneration in anterior cervical fusion patients: A machine-learning modeling analysis
    Arnold Y. L. Wong, Garrett Harada, Remy Lee, Sapan D. Gandhi, Adam Dziedzic, Alejandro Espinoza-Orias, Mohamad Parnianpour, Philip K. Louie, Bryce Basques, Howard S. An, Dino Samartzis
    Journal of Orthopaedic Research 2021

    Paper

    Abstract Early onset adjacent segment degeneration (ASD) can be found within six months after anterior cervical discectomy and fusion (ACDF). Deficits in deep paraspinal neck muscles may be related to early onset ASD. This study aimed to determine whether the morphometry of preoperative deep neck muscles (multifidus and semispinalis cervicis) predicted early onset ASD in patients with ACDF. Thirty-two cases of early onset ASD after a two-level ACDF and 30 matched non-ASD cases were identified from a large-scale cohort. The preoperative total cross-sectional area (CSA) of bilateral deep neck muscles and the lean muscle CSAs from C3 to C7 levels were measured manually on T2-weighted magnetic resonance imaging. Paraspinal muscle CSA asymmetry at each level was calculated. A support vector machine (SVM) algorithm was used to identify demographic, radiographic, and/or muscle parameters that predicted proximal/distal ASD development. No significant between-group differences in demographic or preoperative radiographic data were noted (mean age: 52.4 ± 10.9 years). ACDFs comprised C3 to C5 (n = 9), C4 to C6 (n = 20), and C5 to C7 (n = 32) cases. Eighteen, eight, and six patients had proximal, distal, or both ASD, respectively. The SVM model achieved high accuracy (96.7%) and an area under the curve (AUC = 0.97) for predicting early onset ASD. Asymmetry of fat at C5 (coefficient: 0.06), and standardized measures of C7 lean (coefficient: 0.05) and total CSA measures (coefficient: 0.05) were the strongest predictors of early onset ASD. This is the first study to show that preoperative deep neck muscle CSA, composition, and asymmetry at C5 to C7 independently predicted postoperative early onset ASD in patients with ACDF. Paraspinal muscle assessments are recommended to identify high-risk patients for personalized intervention.
    @article{wong2021ML,
      author = {Wong, Arnold Y. L. and Harada, Garrett and Lee, Remy and Gandhi, Sapan D. and Dziedzic, Adam and Espinoza-Orias, Alejandro and Parnianpour, Mohamad and Louie, Philip K. and Basques, Bryce and An, Howard S. and Samartzis, Dino},
      title = {Preoperative paraspinal neck muscle characteristics predict early onset adjacent segment degeneration in anterior cervical fusion patients: A machine-learning modeling analysis},
      journal = {Journal of Orthopaedic Research},
      volume = {39},
      number = {8},
      pages = {1732-1744},
      keywords = {adjacent segment, cervical, degeneration, disc, disease, muscles, paraspinal, spine},
      doi = {https://doi.org/10.1002/jor.24829},
      eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/jor.24829},
      year = {2021}
    }
    
  3. On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples
    Adelin Travers, Lorna Licollari, Guanghan Wang, Varun Chandrasekaran, Adam Dziedzic, David Lie, Nicolas Papernot
    2021 Preprint

    Paper

    Machine learning (ML) models are known to be vulnerable to adversarial examples. Applications of ML to voice biometrics authentication are no exception. Yet, the implications of audio adversarial examples on these real-world systems remain poorly understood given that most research targets limited defenders who can only listen to the audio samples. Conflating detectability of an attack with human perceptibility, research has focused on methods that aim to produce imperceptible adversarial examples which humans cannot distinguish from the corresponding benign samples. We argue that this perspective is coarse for two reasons: 1. Imperceptibility is impossible to verify; it would require an experimental process that encompasses variations in listener training, equipment, volume, ear sensitivity, types of background noise etc, and 2. It disregards pipeline-based detection clues that realistic defenders leverage. This results in adversarial examples that are ineffective in the presence of knowledgeable defenders. Thus, an adversary only needs an audio sample to be plausible to a human. We thus introduce surreptitious adversarial examples, a new class of attacks that evades both human and pipeline controls. In the white-box setting, we instantiate this class with a joint, multi-stage optimization attack. Using an Amazon Mechanical Turk user study, we show that this attack produces audio samples that are more surreptitious than previous attacks that aim solely for imperceptibility. Lastly we show that surreptitious adversarial examples are challenging to develop in the black-box setting.
    @misc{travers2021exploitability,
      title = {On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples},
      author = {Travers, Adelin and Licollari, Lorna and Wang, Guanghan and Chandrasekaran, Varun and Dziedzic, Adam and Lie, David and Papernot, Nicolas},
      year = {2021},
      eprint = {2108.02010},
      archiveprefix = {arXiv},
      primaryclass = {cs.SD},
      journal = {preprint arXiv:2108.02010}
    }
    
  4. When the Curious Abandon Honesty: Federated Learning Is Not Private
    Franziska Boenisch, Adam Dziedzic, Roei Schuster, Ali Shahin Shamsabadi, Ilia Shumailov, Nicolas Papernot
    2021 Preprint

    Paper Paper

    In federated learning (FL), data does not leave personal devices when they are jointly training a machine learning model. Instead, these devices share gradients with a central party (e.g., a company). Because data never "leaves" personal devices, FL is presented as privacy-preserving. Yet, recently it was shown that this protection is but a thin facade, as even a passive attacker observing gradients can reconstruct data of individual users. In this paper, we argue that prior work still largely underestimates the vulnerability of FL. This is because prior efforts exclusively consider passive attackers that are honest-but-curious. Instead, we introduce an active and dishonest attacker acting as the central party, who is able to modify the shared model’s weights before users compute model gradients. We call the modified weights "trap weights". Our active attacker is able to recover user data perfectly and at near zero costs: the attack requires no complex optimization objectives. Instead, it exploits inherent data leakage from model gradients and amplifies this effect by maliciously altering the weights of the shared model. These specificities enable our attack to scale to models trained with large mini-batches of data. Where attackers from prior work require hours to recover a single data point, our method needs milliseconds to capture the full mini-batch of data from both fully-connected and convolutional deep neural networks. Finally, we consider mitigations. We observe that current implementations of differential privacy (DP) in FL are flawed, as they explicitly trust the central party with the crucial task of adding DP noise, and thus provide no protection against a malicious central party. We also consider other defenses and explain why they are similarly inadequate. A significant redesign of FL is required for it to provide any meaningful form of data privacy to users.
    @misc{boenisch2021curious,
      title = {When the Curious Abandon Honesty: Federated Learning Is Not Private},
      author = {Boenisch, Franziska and Dziedzic, Adam and Schuster, Roei and Shamsabadi, Ali Shahin and Shumailov, Ilia and Papernot, Nicolas},
      year = {2021},
      eprint = {2112.02918},
      archiveprefix = {arXiv},
      primaryclass = {cs.LG},
      journal = {preprint arXiv:2112.02918}
    }
    
  5. Private AI Collaborative Research Institute: Vision, Challenges, and Opportunities
    Ahmad-Reza Sadeghi, Ferdinand Brasser, Markus Miettinen, Thien Duc Nguyen, Thomas Given-Wilson, Axel Legay, Murali Annaaram, Salman Avestimeh, Alexandra Dmitrienko, Farinaz Koushanfar, Buse Gul Atli, Florian Kerschbaum, Lachlan J. Gunn, N. Asokan, Matthias Schunter, Rosario Cammarota, Adam Dziedzic, Nicolas Papernot, Virginia Smith, Reza Shokri
    2021

    Paper

    This document outlines the research vision of the collaborative research center for Privacy-preserving Machine Learning. While federated machine learning starts to be deployed, its security and privacy implications are not well understood today. Our goal is to conduct research enabling the future of decentralized machine learning: Underpinning Federated ML with robust privacy guarantees and efficient algoritms to achieve those guarantees. Exploring knowledge transfer and collaborative ML beyond Federated machine learning that suffers from a central controller as its root of trust. Exploring Graph Neural Networks and their privacy implications. Ensuring robustness against malicious participants that may steal models or may try to poison maching learning models during training. This research will then be validated in case studies and deployed in open source frameworks to allow further experimentation and deployment on a wider scale.
    @misc{IntelPrivateAIVision2021,
      title = {Private AI Collaborative Research Institute: Vision, Challenges, and Opportunities},
      author = {Sadeghi, Ahmad-Reza and Brasser, Ferdinand and Miettinen, Markus and Nguyen, Thien Duc and Given-Wilson, Thomas and Legay, Axel and Annaaram, Murali and Avestimeh, Salman and Dmitrienko, Alexandra and Koushanfar, Farinaz and Atli, Buse Gul and Kerschbaum, Florian and Gunn, Lachlan J. and Asokan, N. and Schunter, Matthias and Cammarota, Rosario and Dziedzic, Adam and Papernot, Nicolas and Smith, Virginia and Shokri, Reza},
      year = {2021}
    }
    
  6. Private Multi-Winner Voting For Machine Learning
    Adam Dziedzic, Christopher A Choquette-Choo, Natalie Dullerud, Vinith Menon Suriyakumar, Ali Shahin Shamsabadi, Muhammad Ahmad Kaleem, Somesh Jha, Nicolas Papernot, Xiao Wang
    2021

    Paper

    Private multi-winner voting is the task of revealing k-hot binary vectors that satisfy a bounded differential privacy guarantee. This task has been understudied in the machine learning literature despite its prevalence in many domains such as healthcare. We propose three new privacy-preserving multi-label mechanisms: Binary, and Powerset voting. Binary voting operates independently per label through composition. voting bounds votes optimally in their norm. Powerset voting operates over the entire binary vector by viewing the possible outcomes as a power set. We theoretically analyze tradeoffs showing that Powerset voting requires strong correlations between labels to outperform Binary voting. We use these mechanisms to enable privacy-preserving multi-label learning by extending the canonical single-label technique: PATE. We empirically compare our techniques with DPSGD on large real-world healthcare data and standard multi-label benchmarks. We find that our techniques outperform all others in the centralized setting. We enable multi-label CaPC and show that our mechanisms can be used to collaboratively improve models in a multi-site (distributed) setting.
    @misc{PrivateMultiWinnerVoting2021,
      title = {Private Multi-Winner Voting For Machine Learning},
      author = {Dziedzic, Adam and Choquette-Choo, Christopher A and Dullerud, Natalie and Suriyakumar, Vinith Menon and Shamsabadi, Ali Shahin and Kaleem, Muhammad Ahmad and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      year = {2021}
    }
    

2020

  1. Pretrained Transformers Improve Out-of-Distribution Robustness
    Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song
    In ACL (Association for Computational Linguistics) 2020

    Paper Slides Video Code

    Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers’ performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
    @inproceedings{hendrycks-etal-2020-pretrained,
      title = {Pretrained Transformers Improve Out-of-Distribution Robustness},
      author = {Hendrycks, Dan and Liu, Xiaoyuan and Wallace, Eric and Dziedzic, Adam and Krishnan, Rishabh and Song, Dawn},
      booktitle = { ACL (Association for Computational Linguistics)},
      month = jul,
      year = {2020},
      address = {Online},
      publisher = {ACL (Association for Computational Linguistics)},
      doi = {10.18653/v1/2020.acl-main.244},
      pages = {2744--2751}
    }
    
  2. Machine Learning based detection of multiple Wi-Fi BSSs for LTE-U CSAT
    Vanlin Sathya, Adam Dziedzic, Monisha Ghosh, Sanjay Krishnan
    In ICNC (International Conference on Computing, Networking and Communications) 2020

    Paper

    According to the LTE-U Forum specification, a LTE-U base-station (BS) reduces its duty cycle from 50% to 33% when it senses an increase in the number of co-channel Wi-Fi basic service sets (BSSs) from one to two. The detection of the number of Wi-Fi BSSs that are operating on the channel in real-time, without decoding the Wi-Fi packets, still remains a challenge. In this paper, we present a novel machine learning (ML) approach that solves the problem by using energy values observed during LTE-U OFF duration. Observing the energy values (at LTE-U BS OFF time) is a much simpler operation than decoding the entire Wi-Fi packets. In this work, we implement and validate the proposed ML based approach in real-time experiments, and demonstrate that there are two distinct patterns between one and two Wi-Fi APs. This approach delivers an accuracy close to 100% compared to auto-correlation (AC) and energy detection (ED) approaches.
    @inproceedings{sathya2020machine,
      title = {Machine Learning based detection of multiple Wi-Fi BSSs for LTE-U CSAT},
      author = {Sathya, Vanlin and Dziedzic, Adam and Ghosh, Monisha and Krishnan, Sanjay},
      booktitle = {ICNC (International Conference on Computing, Networking and Communications)},
      year = {2020},
      organization = {IEEE}
    }
    
  3. An Empirical Evaluation of Perturbation-based Defenses
    Adam Dziedzic, Sanjay Krishnan
    preprint arXiv:2002.03080 2020 Preprint

    Paper

    Recent work has extensively shown that randomized perturbations of a neural network can improve its robustness to adversarial attacks. The literature is, however, lacking a detailed compare-and-contrast of the latest proposals to understand what classes of perturbations work, when they work, and why they work. We contribute a detailed experimental evaluation that elucidates these questions and benchmarks perturbation defenses in a consistent way. In particular, we show five main results: (1) all input perturbation defenses, whether random or deterministic, are essentially equivalent in their efficacy, (2) such defenses offer almost no robustness to adaptive attacks unless these perturbations are observed during training, (3) a tuned sequence of noise layers across a network provides the best empirical robustness, (4) attacks transfer between perturbation defenses so the attackers need not know the specific type of defense only that it involves perturbations, and (5) adversarial examples very close to original images show an elevated sensitivity to perturbation in a first-order analysis. Based on these insights, we demonstrate a new robust model built on noise injection and adversarial training that achieves state-of-the-art robustness.
    @article{dziedzic2020empirical,
      title = {An Empirical Evaluation of Perturbation-based Defenses},
      author = {Dziedzic, Adam and Krishnan, Sanjay},
      journal = {preprint arXiv:2002.03080},
      year = {2020}
    }
    
  4. Machine Learning enabled Spectrum Sharing in Dense LTE-U/Wi-Fi Coexistence Scenarios
    Adam Dziedzic, Vanlin Sathya, Muhammad Rochman, Monisha Ghosh, Sanjay Krishnan
    OJVT (IEEE Open Journal of Vehicular Technology) 2020

    The application of Machine Learning (ML) techniques to complex engineering problems has proved to be an attractive and efficient solution. ML has been successfully applied to several practical tasks like image recognition, automating industrial operations, etc. The promise of ML techniques in solving non-linear problems influenced this work which aims to apply known ML techniques and develop new ones for wireless spectrum sharing between Wi-Fi and LTE in the unlicensed spectrum. In this work, we focus on the LTE-Unlicensed (LTE-U) specification developed by the LTE-U Forum, which uses the duty-cycle approach for fair coexistence. The specification suggests reducing the duty cycle at the LTE-U base-station (BS) when the number of co-channel Wi-Fi basic service sets (BSSs) increases from one to two or more. However, without decoding the Wi-Fi packets, detecting the number of Wi-Fi BSSs operating on the channel in real-time is a challenging problem. In this work, we demonstrate a novel ML-based approach which solves this problem by using energy values observed during the LTE-U OFF duration. It is relatively straightforward to observe only the energy values during the LTE-U BS OFF time compared to decoding the entire Wi-Fi packet, which would require a full Wi-Fi receiver at the LTE-U base-station. We implement and validate the proposed ML-based approach by real-time experiments and demonstrate that there exist distinct patterns between the energy distributions between one and many Wi-Fi AP transmissions. The proposed ML-based approach results in a higher accuracy (close to 99% in all cases) as compared to the existing auto-correlation (AC) and energy detection (ED) approaches.
    @article{dziedzic2020machine,
      title = {Machine Learning enabled Spectrum Sharing in Dense LTE-U/Wi-Fi Coexistence Scenarios},
      author = {Dziedzic, Adam and Sathya, Vanlin and Rochman, Muhammad and Ghosh, Monisha and Krishnan, Sanjay},
      journal = {OJVT (IEEE Open Journal of Vehicular Technology)},
      year = {2020},
      publisher = {IEEE}
    }
    
  5. Input and Model Compression for Adaptive and Robust Neural Networks
    Adam Dziedzic
    2020 Thesis

    Paper

    @article{dziedzic2020input,
      title = {Input and Model Compression for Adaptive and Robust Neural Networks},
      author = {Dziedzic, Adam},
      year = {2020},
      publisher = {The University of Chicago}
    }
    

2019

  1. Band-limited Training and Inference for Convolutional Neural Networks
    Adam Dziedzic, Ioannis Paparizzos, Sanjay Krishnan, Aaron Elmore, Michael Franklin
    In ICML (International Conference on Machine Learning) 2019

    Paper Slides Video

    The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.
    @inproceedings{dziedzic2019band,
      title = {Band-limited Training and Inference for Convolutional Neural Networks},
      author = {Dziedzic, Adam and Paparizzos, Ioannis and Krishnan, Sanjay and Elmore, Aaron and Franklin, Michael},
      booktitle = {ICML (International Conference on Machine Learning)},
      year = {2019}
    }
    
  2. Artificial intelligence in resource-constrained and shared environments
    Sanjay Krishnan, Aaron J Elmore, Michael Franklin, John Paparrizos, Zechao Shang, Adam Dziedzic, Rui Liu
    ACM SIGOPS Operating Systems Review 2019

    Paper

    The computational demands of modern AI techniques are immense, and as the number of practical applications grows, there will be an increasing burden on shared computing infrastructure. We envision a forthcoming era of "AI Systems" research where reducing resource consumption, reasoning about transient resource availability, trading off resource consumption for accuracy, and managing contention on specialized hardware will become the community’s main research focus. This paper overviews the history of AI systems research, a vision for the future, and the open challenges ahead.
    @article{krishnan2019artificial,
      title = {Artificial intelligence in resource-constrained and shared environments},
      author = {Krishnan, Sanjay and Elmore, Aaron J and Franklin, Michael and Paparrizos, John and Shang, Zechao and Dziedzic, Adam and Liu, Rui},
      journal = {ACM SIGOPS Operating Systems Review},
      volume = {53},
      number = {1},
      pages = {1--6},
      year = {2019},
      publisher = {ACM New York, NY, USA}
    }
    

2018

  1. Columnstore and B+ Tree - Are Hybrid Physical Designs Important?
    Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, Manoj Syamala
    In SIGMOD (ACM Special Interest Group on Management of Data) 2018

    Paper Slides

    Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied — a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.
    @inproceedings{dziedzic2018index,
      title = {Columnstore and B+ Tree - Are Hybrid Physical Designs Important?},
      author = {Dziedzic, Adam and Wang, Jingjing and Das, Sudipto and Ding, Bolin and Narasayya, Vivek R. and Syamala, Manoj},
      booktitle = {SIGMOD (ACM Special Interest Group on Management of Data)},
      year = {2018}
    }
    
  2. Deeplens: Towards a visual data management system
    Sanjay Krishnan, Adam Dziedzic, Aaron J Elmore
    CIDR (Conference on Innovative Data Systems Research) 2018

    Paper

    Advances in deep learning have greatly widened the scope of automatic computer vision algorithms and enable users to ask questions directly about the content in images and video. This paper explores the necessary steps towards a future Visual Data Management System (VDMS), where the predictions of such deep learning models are stored, managed, queried, and indexed. We propose a query and data model that disentangles the neural network models used, the query workload, and the data source semantics from the query processing layer. Our system, DeepLens, is based on dataflow query processing systems and this research prototype presents initial experiments to elicit important open research questions in visual analytics systems. One of our main conclusions is that any future "declarative" VDMS will have to revisit query optimization and automated physical design from a unified perspective of performance and accuracy tradeoffs. Physical design and query optimization choices can not only change performance by orders of magnitude, they can potentially affect the accuracy of results.
    @article{krishnan2018deeplens,
      title = {Deeplens: Towards a visual data management system},
      author = {Krishnan, Sanjay and Dziedzic, Adam and Elmore, Aaron J},
      journal = {CIDR (Conference on Innovative Data Systems Research)},
      year = {2018}
    }
    

2017

  1. Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.
    Tim Mattson, Vijay Gadepally, Zuohao She, Adam Dziedzic, Jeff Parkhurst
    In CIDR (Conference on Innovative Data Systems Research) 2017

    Paper

    In most Big Data applications, the data is heterogeneous. As we have been arguing in a series of papers, storage engines should be well suited to the data they hold. Therefore, a system supporting Big Data applications should be able to expose multiple storage engines through a single interface. We call such systems, polystore systems. Our reference implementation of the polystore concept is called BigDAWG (short for the Big Data Analytics Working Group). In this demonstration, we will show the BigDAWG system and a number of polystore applications built to help ocean metage-nomics researchers handle their heterogenous Big Data.
    @inproceedings{mattson2017demonstrating,
      title = {Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.},
      author = {Mattson, Tim and Gadepally, Vijay and She, Zuohao and Dziedzic, Adam and Parkhurst, Jeff},
      booktitle = {CIDR (Conference on Innovative Data Systems Research)},
      year = {2017}
    }
    
  2. Bigdawg polystore release and demonstration
    Kyle OBrien, Vijay Gadepally, Jennie Duggan, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Zuohao She, Michael Stonebraker
    preprint arXiv:1701.05799 2017 Preprint

    Paper

    @article{obrien2017bigdawg,
      title = {Bigdawg polystore release and demonstration},
      author = {OBrien, Kyle and Gadepally, Vijay and Duggan, Jennie and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and She, Zuohao and Stonebraker, Michael},
      journal = {preprint arXiv:1701.05799},
      year = {2017}
    }
    
  3. Version 0.1 of the bigdawg polystore system
    Vijay Gadepally, Kyle OBrien, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Jennie Rogers, Zuohao She, Michael Stonebraker
    preprint arXiv:1707.00721 2017 Preprint

    Paper

    @article{gadepally2017version,
      title = {Version 0.1 of the bigdawg polystore system},
      author = {Gadepally, Vijay and OBrien, Kyle and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and Rogers, Jennie and She, Zuohao and Stonebraker, Michael},
      journal = {preprint arXiv:1707.00721},
      year = {2017}
    }
    
  4. BigDAWG version 0.1
    Vijay Gadepally, Kyle O’Brien, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Jennie Rogers, Zuohao She, Michael Stonebraker
    In HPEC (IEEE High Performance Extreme Computing) 2017

    @inproceedings{gadepally2017bigdawg,
      title = {BigDAWG version 0.1},
      author = {Gadepally, Vijay and O'Brien, Kyle and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and Rogers, Jennie and She, Zuohao and Stonebraker, Michael},
      booktitle = {HPEC (IEEE High Performance Extreme Computing)},
      pages = {1--7},
      year = {2017},
      organization = {IEEE}
    }
    
  5. Data Loading, Transformation and Migration for Database Management Systems
    Adam Dziedzic
    2017 Thesis

    Paper

    @article{dziedzic2017data,
      title = {Data Loading, Transformation and Migration for Database Management Systems},
      author = {Dziedzic, Adam},
      year = {2017},
      publisher = {The University of Chicago}
    }
    
  6. September 2017. BigDAWG Version 0.1
    V Gadepally, K O’Brien, A Dziedzic, A Elmore, J Kepner, S Madden, T Mattson, J Rogers, Z She, M Stonebraker
    HPEC (IEEE High Performance Extreme Computing) 2017

    @article{gadepallyseptember,
      title = {September 2017. BigDAWG Version 0.1},
      author = {Gadepally, V and O'Brien, K and Dziedzic, A and Elmore, A and Kepner, J and Madden, S and Mattson, T and Rogers, J and She, Z and Stonebraker, M},
      journal = {HPEC (IEEE High Performance Extreme Computing)},
      year = {2017}
    }
    

2016

  1. DBMS Data Loading: An Analysis on Modern Hardware
    Adam Dziedzic, Manos Karpathiotakis, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki
    In ADMS (Accelerating analytics and Data Management Systems) 2016

    Paper Slides

    Data loading has traditionally been considered a one-time deal - an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the one-off assumption, and has turned data loading into a major bottleneck of the data analysis pipeline. This paper analyzes the behavior of modern DBMSs in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMSs, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMSs are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.
    @inproceedings{dziedzic2016dbms,
      title = {DBMS Data Loading: An Analysis on Modern Hardware},
      author = {Dziedzic, Adam and Karpathiotakis, Manos and Alagiannis, Ioannis and Appuswamy, Raja and Ailamaki, Anastasia},
      booktitle = {ADMS (Accelerating analytics and Data Management Systems)},
      year = {2016}
    }
    
  2. Data Transformation and Migration in Polystores
    Adam Dziedzic, Aaron Elmore, Michael Stonebraker
    In HPEC (IEEE High Performance Extreme Computing) 2016

    Paper Poster Slides

    Ever increasing data size and new requirements in data processing has fostered the development of many new database systems. The result is that many data-intensive applications are underpinned by different engines. To enable data mobility there is a need to transfer data between systems easily and efficiently. We analyze the state-of-the-art of data migration and outline research opportunities for a rapid data transfer. Our experiments explore data migration between a diverse set of databases, including PostgreSQL, SciDB, S-Store and Accumulo. Each of the systems excels at specific application requirements, such as transactional processing, numerical computation, streaming data, and large scale text processing. Providing an efficient data migration tool is essential to take advantage of superior processing from that specialized databases. Our goal is to build such a data migration framework that will take advantage of recent advancement in hardware and software.
    @inproceedings{dziedzic2016transformation,
      title = {Data Transformation and Migration in Polystores},
      author = {Dziedzic, Adam and Elmore, Aaron and Stonebraker, Michael},
      booktitle = {HPEC (IEEE High Performance Extreme Computing)},
      year = {2016},
      organization = {IEEE}
    }
    
  3. Integrating Real-Time and Batch Processing in a Polystore
    John Meehan, Stan Zdonik, Shaobo Tian, Yulong Tian, Nesime Tatbul, Adam Dziedzic, Aaron Elmore
    In HPEC (IEEE High Performance Extreme Computing) 2016

    Paper

    This paper describes a stream processing engine called S-Store and its role in the BigDAWG polystore. Fundamentally, S-Store acts as a frontend processor that accepts input from multiple sources, and massages it into a form that has eliminated errors (data cleaning) and translates that input into a form that can be efficiently ingested into BigDAWG. S-Store also acts as an intelligent router that sends input tuples to the appropriate components of BigDAWG. All updates to S-Store’s shared memory are done in a transactionally consistent (ACID) way, thereby eliminating new errors caused by non-synchronized reads and writes. The ability to migrate data from component to component of BigDAWG is crucial. We have described a migrator from S-Store to Postgres that we have implemented as a first proof of concept. We report some interesting results using this migrator that impact the evaluation of query plans.
    @inproceedings{meehan2016integrating,
      title = {Integrating Real-Time and Batch Processing in a Polystore},
      author = {Meehan, John and Zdonik, Stan and Tian, Shaobo and Tian, Yulong and Tatbul, Nesime and Dziedzic, Adam and Elmore, Aaron},
      booktitle = {HPEC (IEEE High Performance Extreme Computing)},
      year = {2016}
    }
    

2015

  1. BigDAWG: a Polystore for Diverse Interactive Applications
    Adam Dziedzic, Jennie Duggan, Aaron J. Elmore, Vijay Gadepally, Michael Stonebraker
    In DSIA (IEEE Viz Data Systems for Interactive Analysis) 2015

    Paper

    Interactive analytics requires low latency queries in the presence of diverse, complex, and constantly evolving workloads. To address these challenges, we introduce a polystore, BigDAWG, that tightly couples diverse database systems, data models, and query languages through use of semantically grouped Islands of Information. BigDAWG, which stands for the Big Data Working Group, seeks to provide location transparency by matching the right system for each workload using black-box model of query and system performance. In this paper we introduce BigDAWG as a solution to diverse web-based interactive applications and motivate our key challenges in building BigDAWG. BigDAWG continues to evolve and, where applicable, we have noted the current status of its implementation.
    @inproceedings{dziedzic2015bigdawg,
      title = {BigDAWG: a Polystore for Diverse Interactive Applications},
      author = {Dziedzic, Adam and Duggan, Jennie and Elmore, Aaron J. and Gadepally, Vijay and Stonebraker, Michael},
      booktitle = {DSIA (IEEE Viz Data Systems for Interactive Analysis)},
      year = {2015}
    }
    

2014

  1. Analysis and comparison of NoSQL databases with an introduction to consistent references in Big Data storage systems
    Adam Dziedzic, Jan Mulawka
    In Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2014

    Paper

    NoSQL is a new approach to data storage and manipulation. The aim of this paper is to gain more insight into NoSQL databases, as we are still in the early stages of understanding when to use them and how to use them in an appropriate way. In this submission descriptions of selected NoSQL databases are presented. Each of the databases is analysed with primary focus on its data model, data access, architecture and practical usage in real applications. Furthemore, the NoSQL databases are compared in fields of data references. The relational databases offer foreign keys, whereas NoSQL databases provide us with limited references. An intermediate model between graph theory and relational algebra which can address the problem should be created. Finally, the proposal of a new approach to the problem of inconsistent references in Big Data storage systems is introduced.
    @inproceedings{dziedzic2014analysis,
      title = {Analysis and comparison of NoSQL databases with an introduction to consistent references in Big Data storage systems},
      author = {Dziedzic, Adam and Mulawka, Jan},
      booktitle = {Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments},
      volume = {9290},
      pages = {92902V},
      year = {2014},
      organization = {International Society for Optics and Photonics}
    }