Adam Dziedzic

I am a Tenure Track Faculty Member at CISPA, where I co-lead the SprintML group. My research is focused on secure and trustworthy Machine Learning as a Service (MLaaS). I design robust and reliable machine learning methods for training and inference of ML models while preserving data privacy and model confidentiality.

I was a Postdoctoral Fellow at the Vector Institute and the University of Toronto, a member of the CleverHans Lab, advised by Prof. Nicolas Papernot. I earned my PhD at the University of Chicago, where I was advised by Prof. Sanjay Krishnan and worked on input and model compression for adaptive and robust neural networks. I obtained my Bachelor's and Master's degrees from Warsaw University of Technology in Poland. I was also studying at DTU (Technical University of Denmark) and carried out research at EPFL, Switzerland. I also worked at CERN (Geneva, Switzerland), Barclays Investment Bank in London (UK), Microsoft Research (Redmond, USA) and Google (Madison, USA).

Hiring: we are searching for ambitious students who would like to work with us in our SprintML group at CISPA. Please, feel free to email me if you are interested in this opportunity.

Email: adam.dziedzic@sprintml.com (my public PGP key)
Address: CISPA Helmholtz Center for Information Security, Stuhlsatzenhaus 5, 66123 Saarbrücken, Germany

Selected Publications

Please also check the Full List of my publications and the Google Scholar profile.

  1. Memorization in Self-Supervised Learning Improves Downstream Generalization
    Wenhao Wang, Muhammad Ahmad Kaleem, Adam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch
    In The Twelfth International Conference on Learning Representations (ICLR) 2024

    Paper Poster Code

    Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data—often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations—both known in supervised learning as regularization techniques that reduce overfitting—still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.
    @inproceedings{wang2024memorization,
      title = {Memorization in Self-Supervised Learning Improves Downstream Generalization},
      author = {Wang, Wenhao and Kaleem, Muhammad Ahmad and Dziedzic, Adam and Backes, Michael and Papernot, Nicolas and Boenisch, Franziska},
      booktitle = {The Twelfth International Conference on Learning Representations (ICLR)},
      year = {2024}
    }
    
  2. Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models
    Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch
    In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024

    Paper

    Diffusion models (DMs) produce very detailed and high-quality images. Their power results from extensive training on large amounts of data, usually scraped from the internet without proper attribution or consent from content creators. Unfortunately, this practice raises privacy and intellectual property concerns, as DMs can memorize and later reproduce their potentially sensitive or copyrighted training images at inference time. Prior efforts prevent this issue by either changing the input to the diffusion process, thereby preventing the DM from generating memorized samples during inference, or removing the memorized data from training altogether. While those are viable solutions when the DM is developed and deployed in a secure and constantly monitored environment, they hold the risk of adversaries circumventing the safeguards and are not effective when the DM itself is publicly released. To solve the problem, we introduce NeMo, the first method to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers. Through our experiments, we make the intriguing finding that in many cases, single neurons are responsible for memorizing particular training samples. By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data. In this way, our NeMo contributes to a more responsible deployment of DMs.
    @inproceedings{hintersdorf2024MemorizationDiffusionModels,
      title = {Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models},
      author = {Hintersdorf, Dominik and Struppek, Lukas and Kersting, Kristian and Dziedzic, Adam and Boenisch, Franziska},
      year = {2024},
      booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
    }
    
  3. Localizing Memorization in SSL Vision Encoders
    Wenhao Wang, Adam Dziedzic, Michael Backes, Franziska Boenisch
    In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024

    Paper

    Recent work on studying memorization in self-supervised learning (SSL) suggests that even though SSL encoders are trained on millions of images, they still memorize individual data points. While effort has been put into characterizing the memorized data and linking encoder memorization to downstream utility, little is known about where the memorization happens inside SSL encoders. To close this gap, we propose two metrics for localizing memorization in SSL encoders on a per-layer (layermem) and per-unit basis (unitmem). Our localization methods are independent of the downstream task, do not require any label information, and can be performed in a forward pass. By localizing memorization in various encoder architectures (convolutional and transformer-based) trained on diverse datasets with contrastive and non-contrastive SSL frameworks, we find that (1) while SSL memorization increases with layer depth, highly memorizing units are distributed across the entire encoder, (2) a significant fraction of units in SSL encoders experiences surprisingly high memorization of individual data points, which is in contrast to models trained under supervision, (3) atypical (or outlier) data points cause much higher layer and unit memorization than standard data points, and (4) in vision transformers, most memorization happens in the fully-connected layers. Finally, we show that localizing memorization in SSL has the potential to improve fine-tuning and to inform pruning strategies.
    @inproceedings{wang2024LocalizeMemorizationSSL,
      title = {Localizing Memorization in SSL Vision Encoders},
      author = {Wang, Wenhao and Dziedzic, Adam and Backes, Michael and Boenisch, Franziska},
      year = {2024},
      booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
    }
    
  4. LLM Dataset Inference: Did you train on my dataset?
    Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic
    In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024

    Paper

    The proliferation of large language models (LLMs) in the real world has come with a rise in copyright cases against companies for training their models on unlicensed data from the internet. Recent works have presented methods to identify if individual text sequences were members of the model’s training data, known as membership inference attacks (MIAs). We demonstrate that the apparent success of these MIAs is confounded by selecting non-members (text sequences not used for training) belonging to a different distribution from the members (e.g., temporally shifted recent Wikipedia articles compared with ones used to train the model). This distribution shift makes membership inference appear successful. However, most MIA methods perform no better than random guessing when discriminating between members and non-members from the same distribution (e.g., in this case, the same period of time). Even when MIAs work, we find that different MIAs succeed at inferring membership of samples from different distributions. Instead, we propose a new dataset inference method to accurately identify the datasets used to train large language models. This paradigm sits realistically in the modern-day copyright landscape, where authors claim that an LLM is trained over multiple documents (such as a book) written by them, rather than one particular paragraph. While dataset inference shares many of the challenges of membership inference, we solve it by selectively combining the MIAs that provide positive signal for a given distribution, and aggregating them to perform a statistical test on a given dataset. Our approach successfully distinguishes the train and test sets of different subsets of the Pile with statistically significant p-values < 0.1, without any false positives.
    @inproceedings{maini2024LLMDatasetInference,
      title = {LLM Dataset Inference: Did you train on my dataset?},
      author = {Maini, Pratyush and Jia, Hengrui and Papernot, Nicolas and Dziedzic, Adam},
      year = {2024},
      booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
    }
    
  5. Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives
    Vincent Hanke, Tom Blanchard, Franziska Boenisch, Iyiola Emmanuel Olatunji, Michael Backes, Adam Dziedzic
    In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024

    Paper

    While open Large Language Models (LLMs) have made significant progress, they still fall short of matching the performance of their closed, proprietary counterparts, making the latter attractive even for the use on highly private data. Recently, various new methods have been proposed to adapt closed LLMs to private data without leaking private information to third parties and/or the LLM provider. In this work, we analyze the privacy protection and performance of the four most recent methods for private adaptation of closed LLMs. By examining their threat models and thoroughly comparing their performance under different privacy levels according to differential privacy (DP), various LLM architectures, and multiple datasets for classification and generation tasks, we find that: (1) all the methods leak query data, i.e., the (potentially sensitive) user data that is queried at inference time, to the LLM provider, (2) three out of four methods also leak large fractions of private training data to the LLM provider while the method that protects private data requires a local open LLM, (3) all the methods exhibit lower performance compared to three private gradient-based adaptation methods for local open LLMs, and (4) the private adaptation methods for closed LLMs incur higher monetary training and query costs than running the alternative methods on local open LLMs. This yields the conclusion that, to achieve truly privacy-preserving LLM adaptations that yield high performance and more privacy at lower costs, taking into account current methods and models, one should use open LLMs.
    @inproceedings{hanke2024openLLMs,
      title = {Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives},
      author = {Hanke, Vincent and Blanchard, Tom and Boenisch, Franziska and Olatunji, Iyiola Emmanuel and Backes, Michael and Dziedzic, Adam},
      year = {2024},
      booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
    }
    
  6. Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data
    Congyu Fang, Adam Dziedzic, Lin Zhang, Laura Oliva, Amol Verma, Fahad Razak, Nicolas Papernot, Bo Wang
    In eBioMedicine 2024

    Paper

    Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.
    @inproceedings{fang2024collaborative,
      title = {Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data},
      author = {Fang, Congyu and Dziedzic, Adam and Zhang, Lin and Oliva, Laura and Verma, Amol and Razak, Fahad and Papernot, Nicolas and Wang, Bo},
      booktitle = {eBioMedicine},
      year = {2024}
    }
    
  7. Private Multi-Winner Voting for Machine Learning
    Adam Dziedzic, Christopher A Choquette-Choo, Natalie Dullerud, Vinith Menon Suriyakumar, Ali Shahin Shamsabadi, Muhammad Ahmad Kaleem, Somesh Jha, Nicolas Papernot, Xiao Wang
    In Privacy Enhancing Technologies Symposium (PETS) 2023

    Paper Slides Video Code

    Private multi-winner voting is the task of revealing k-hot binary vectors satisfying a bounded differential privacy (DP) guarantee. This task has been understudied in machine learning literature despite its prevalence in many domains such as healthcare. We propose three new DP multi-winner mechanisms: Binary, τ, and Powerset voting. Binary voting operates independently per label through composition. τ voting bounds votes optimally in their ℓ2 norm for tight data-independent guarantees. Powerset voting operates over the entire binary vector by viewing the possible outcomes as a power set. Our theoretical and empirical analysis shows that Binary voting can be a competitive mechanism on many tasks unless there are strong correlations between labels, in which case Powerset voting outperforms it. We use our mechanisms to enable privacy-preserving multi-label learning in the central setting by extending the canonical single-label technique: PATE. We find that our techniques outperform current state-of-the-art approaches on large, real-world healthcare data and standard multi-label benchmarks. We further enable multi-label confidential and private collaborative (CaPC) learning and show that model performance can be significantly improved in the multi-site setting.
    @inproceedings{multilabel2023pets,
      title = {Private Multi-Winner Voting for Machine Learning},
      author = {Dziedzic, Adam and Choquette-Choo, Christopher A and Dullerud, Natalie and Suriyakumar, Vinith Menon and Shamsabadi, Ali Shahin and Kaleem, Muhammad Ahmad and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      booktitle = {Privacy Enhancing Technologies Symposium (PETS)},
      year = {2023}
    }
    
  8. Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees
    Franziska Boenisch, Christopher Mühl, Roy Rinberg, Jannis Ihrig, Adam Dziedzic
    In Privacy Enhancing Technologies Symposium (PETS) 2023

    Paper

    Applying machine learning (ML) to sensitive domains requires privacy protection of the underlying training data through formal privacy frameworks, such as differential privacy (DP). Yet, usually, the privacy of the training data comes at the cost of the resulting ML models’ utility. One reason for this is that DP uses one uniform privacy budget epsilon for all training data points, which has to align with the strictest privacy requirement encountered among all data holders. In practice, different data holders have different privacy requirements and data points of data holders with lower requirements can contribute more information to the training process of the ML models. To account for this need, we propose two novel methods based on the Private Aggregation of Teacher Ensembles (PATE) framework to support the training of ML models with individualized privacy guarantees. We formally describe the methods, provide a theoretical analysis of their privacy bounds, and experimentally evaluate their effect on the final model’s utility using the MNIST, SVHN, and Adult income datasets. Our empirical results show that the individualized privacy methods yield ML models of higher accuracy than the non-individualized baseline. Thereby, we improve the privacy-utility trade-off in scenarios in which different data holders consent to contribute their sensitive data at different individual privacy levels.
    @inproceedings{pate2023pets,
      author = {Boenisch, Franziska and Mühl, Christopher and Rinberg, Roy and Ihrig, Jannis and Dziedzic, Adam},
      title = {Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees},
      booktitle = {Privacy Enhancing Technologies Symposium (PETS)},
      year = {2023}
    }
    
  9. Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders
    Jan Dubiński, Stanisław Pawlak, Franziska Boenisch, Tomasz Trzcinski, Adam Dziedzic
    In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

    Paper Poster Slides Video Code

    Machine Learning as a Service (MLaaS) APIs provide ready-to-use and high-utility encoders that generate vector representations for given inputs. Since these encoders are very costly to train, they become lucrative targets for model stealing attacks during which an adversary leverages query access to the API to replicate the encoder locally at a fraction of the original training costs. We propose *Bucks for Buckets (B4B)*, the first *active defense* that prevents stealing while the attack is happening without degrading representation quality for legitimate API users. Our defense relies on the observation that the representations returned to adversaries who try to steal the encoder’s functionality cover a significantly larger fraction of the embedding space than representations of legitimate users who utilize the encoder to solve a particular downstream task. B4B leverages this to adaptively adjust the utility of the returned representations according to a user’s coverage of the embedding space. To prevent adaptive adversaries from eluding our defense by simply creating multiple user accounts (sybils), B4B also individually transforms each user’s representations. This prevents the adversary from directly aggregating representations over multiple accounts to create their stolen encoder copy. Our active defense opens a new path towards securely sharing and democratizing encoders over public APIs.
    @inproceedings{dubinski2023bucks,
      title = {Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders},
      author = {Dubiński, Jan and Pawlak, Stanisław and Boenisch, Franziska and Trzcinski, Tomasz and Dziedzic, Adam},
      booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
      year = {2023}
    }
    
  10. Robust and Actively Secure Serverless Collaborative Learning
    Nicholas Franzese, Adam Dziedzic, Christopher A. Choquette-Choo, Mark R. Thomas, Muhammad Ahmad Kaleem, Stephan Rabanser, Congyu Fang, Somesh Jha, Nicolas Papernot, Xiao Wang
    In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

    Paper Poster

    Collaborative machine learning (ML) is widely used to enable institutions to learn better models from distributed data. While collaborative approaches to learning intuitively protect user data, they remain vulnerable to either the server, the clients, or both, deviating from the protocol. Indeed, because the protocol is asymmetric, a malicious server can abuse its power to reconstruct client data points. Conversely, malicious clients can corrupt learning with malicious updates. Thus, both clients and servers require a guarantee when the other cannot be trusted to fully cooperate. In this work, we propose a peer-to-peer (P2P) learning scheme that is secure against malicious servers and robust to malicious clients. Our core contribution is a generic framework that transforms any (compatible) algorithm for robust aggregation of model updates to the setting where servers and clients can act maliciously. Finally, we demonstrate the computational efficiency of our approach even with 1-million parameter models trained by 100s of peers on standard datasets.
    @inproceedings{franzeses2023p2pml,
      title = {Robust and Actively Secure Serverless Collaborative Learning},
      author = {Franzese, Nicholas and Dziedzic, Adam and Choquette-Choo, Christopher A. and Thomas, Mark R. and Kaleem, Muhammad Ahmad and Rabanser, Stephan and Fang, Congyu and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
      year = {2023}
    }
    
  11. Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models
    Haonan Duan, Adam Dziedzic, Nicolas Papernot, Franziska Boenisch
    In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

    Paper Slides Video Code Blog Post

    Large language models (LLMs) are excellent in-context learners. However, the sensitivity of data contained in prompts raises privacy concerns. Our work first shows that these concerns are valid: we instantiate a simple but highly effective membership inference attack against the data used to prompt LLMs. To address this vulnerability, one could forego prompting and resort to fine-tuning LLMs with known algorithms for private gradient descent. However, this comes at the expense of the practicality and efficiency offered by prompting. Therefore, we propose to privately learn to prompt. We first show that soft prompts can be obtained privately through gradient descent on downstream data. However, this is not the case for discrete prompts. Thus, we orchestrate a noisy vote among an ensemble of LLMs presented with different prompts, i.e., a flock of stochastic parrots. The vote privately transfers the flock’s knowledge into a single public prompt. We show that LLMs prompted with our private algorithms closely match the non-private baselines. For example, using GPT3 as the base model, we achieve a downstream accuracy of 92.7% on the sst2 dataset with strong differential privacy guarantees vs. 95.2% for the non-private baseline. Through our experiments, we also show that our prompt-based approach is easily deployed with existing commercial APIs.
    @inproceedings{duan2023flocks,
      title = {Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models},
      author = {Duan, Haonan and Dziedzic, Adam and Papernot, Nicolas and Boenisch, Franziska},
      booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
      year = {2023}
    }
    
  12. Have it your way: Individualized Privacy Assignment for DP-SGD
    Franziska Boenisch, Christopher Mühl, Adam Dziedzic, Roy Rinberg, Nicolas Papernot
    In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

    Paper

    When training a machine learning model with differential privacy, one sets a privacy budget. This budget represents a maximal privacy violation that any user is willing to face by contributing their data to the training set. We argue that this approach is limited because different users may have different privacy expectations. Thus, setting a uniform privacy budget across all points may be overly conservative for some users or, conversely, not sufficiently protective for others. In this paper, we capture these preferences through individualized privacy budgets. To demonstrate their practicality, we introduce a variant of Differentially Private Stochastic Gradient Descent (DP-SGD) which supports such individualized budgets. DP-SGD is the canonical approach to training models with differential privacy. We modify its data sampling and gradient noising mechanisms to arrive at our approach, which we call Individualized DP-SGD (IDP-SGD). Because IDP-SGD provides privacy guarantees tailored to the preferences of individual users and their data points, we find it empirically improves privacy-utility trade-offs.
    @inproceedings{boenisch2023idpsgd,
      title = {Have it your way: Individualized Privacy Assignment for DP-SGD},
      author = {Boenisch, Franziska and Mühl, Christopher and Dziedzic, Adam and Rinberg, Roy and Papernot, Nicolas},
      year = {2023},
      booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
      eprint = {2303.17046},
      archiveprefix = {arXiv},
      primaryclass = {cs.LG}
    }
    
  13. On the privacy risk of in-context learning
    Haonan Duan, Adam Dziedzic, Mohammad Yaghini, Nicolas Papernot, Franziska Boenisch
    In The 61st Annual Meeting Of The Association For Computational Linguistics 2023

    Paper

    Large language models (LLMs) are excellent few-shot learners. They can perform a wide variety of tasks purely based on natural language prompts provided to them. These prompts contain data of a specific downstream task—often the private dataset of a party, e.g., a company that wants to leverage the LLM on their purposes. We show that deploying prompted models presents a significant privacy risk for the data used within the prompt by instantiating a highly effective membership inference attack. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels. After identifying the model’s sensitivity to their prompts—in form of a significantly higher prediction confidence on the prompted data—as a cause for the increased risk, we propose ensembling as a mitigation strategy. By aggregating over multiple different versions of a prompted model, membership inference risk can be decreased.
    @inproceedings{duan2023privacyICL,
      title = {On the privacy risk of in-context learning},
      author = {Duan, Haonan and Dziedzic, Adam and Yaghini, Mohammad and Papernot, Nicolas and Boenisch, Franziska},
      booktitle = {The 61st Annual Meeting Of The Association For Computational Linguistics},
      year = {2023}
    }
    
  14. Dataset Inference for Self-Supervised Models
    Adam Dziedzic, Haonan Duan, Muhammad Ahmad Kaleem, Nikita Dhawan, Jonas Guan, Yannis Cattan, Franziska Boenisch, Nicolas Papernot
    In NeurIPS (Neural Information Processing Systems) 2022

    Paper Slides Video Code

    Self-supervised models are increasingly prevalent in machine learning (ML) since they reduce the need for expensively labeled data. Because of their versatility in downstream applications, they are increasingly used as a service exposed via public APIs. At the same time, these encoder models are particularly vulnerable to model stealing attacks due to the high dimensionality of vector representations they output. Yet, encoders remain undefended: existing mitigation strategies for stealing attacks focus on supervised learning. We introduce a new dataset inference defense, which uses the private training set of the victim encoder model to attribute its ownership in the event of stealing. The intuition is that the log-likelihood of an encoder’s output representations is higher on the victim’s training data than on test data if it is stolen from the victim, but not if it is independently trained. We compute this log-likelihood using density estimation models. As part of our evaluation, we also propose measuring the fidelity of stolen encoders and quantifying the effectiveness of the theft detection without involving downstream tasks; instead, we leverage mutual information and distance measurements. Our extensive empirical results in the vision domain demonstrate that dataset inference is a promising direction for defending self-supervised models against model stealing.
    @inproceedings{datasetinference2022neurips,
      title = {Dataset Inference for Self-Supervised Models},
      author = {Dziedzic, Adam and Duan, Haonan and Kaleem, Muhammad Ahmad and Dhawan, Nikita and Guan, Jonas and Cattan, Yannis and Boenisch, Franziska and Papernot, Nicolas},
      booktitle = {NeurIPS (Neural Information Processing Systems)},
      year = {2022}
    }
    
  15. Increasing the Cost of Model Extraction with Calibrated Proof of Work
    Adam Dziedzic, Muhammad Ahmad Kaleem, Yu Shen Lu, Nicolas Papernot
    In ICLR (International Conference on Learning Representations) [SPOTLIGTH] 2022

    Paper Slides Video Code Blog Post

    In model extraction attacks, adversaries can steal a machine learning model exposed via a public API by repeatedly querying it and adjusting their own model based on obtained predictions. To prevent model stealing, existing defenses focus on detecting malicious queries, truncating, or distorting outputs, thus necessarily introducing a tradeoff between robustness and model utility for legitimate users. Instead, we propose to impede model extraction by requiring users to complete a proof-of-work before they can read the model’s predictions. This deters attackers by greatly increasing (even up to 100x) the computational effort needed to leverage query access for model extraction. Since we calibrate the effort required to complete the proof-of-work to each query, this only introduces a slight overhead for regular users (up to 2x). To achieve this, our calibration applies tools from differential privacy to measure the information revealed by a query. Our method requires no modification of the victim model and can be applied by machine learning practitioners to guard their publicly exposed models against being easily stolen.
    @inproceedings{pow2022iclr,
      title = {Increasing the Cost of Model Extraction with Calibrated Proof of Work},
      author = {Dziedzic, Adam and Kaleem, Muhammad Ahmad and Lu, Yu Shen and Papernot, Nicolas},
      booktitle = {ICLR (International Conference on Learning Representations) [SPOTLIGTH]},
      year = {2022}
    }
    
  16. On the Difficulty of Defending Self-Supervised Learning against Model Extraction
    Adam Dziedzic, Nikita Dhawan, Muhammad Ahmad Kaleem, Jonas Guan, Nicolas Papernot
    In ICML (International Conference on Machine Learning) 2022

    Paper Slides Video Code

    Self-Supervised Learning (SSL) is an increasingly popular ML paradigm that trains models to transform complex inputs into representations without relying on explicit labels. These representations encode similarity structures that enable efficient learning of multiple downstream tasks. Recently, ML-as-a-Service providers have commenced offering trained SSL models over inference APIs, which transform user inputs into useful representations for a fee. However, the high cost involved to train these models and their exposure over APIs both make black-box extraction a realistic security threat. We thus explore model stealing attacks against SSL. Unlike traditional model extraction on classifiers that output labels, the victim models here output representations; these representations are of significantly higher dimensionality compared to the low-dimensional prediction scores output by classifiers. We construct several novel attacks and find that approaches that train directly on a victim’s stolen representations are query efficient and enable high accuracy for downstream models. We then show that existing defenses against model extraction are inadequate and not easily retrofitted to the specificities of SSL.
    @inproceedings{sslextractions2022icml,
      title = {On the Difficulty of Defending Self-Supervised Learning against Model Extraction},
      author = {Dziedzic, Adam and Dhawan, Nikita and Kaleem, Muhammad Ahmad and Guan, Jonas and Papernot, Nicolas},
      booktitle = {ICML (International Conference on Machine Learning)},
      year = {2022}
    }
    
  17. CaPC Learning: Confidential and Private Collaborative Learning
    Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, Xiao Wang
    In ICLR (International Conference on Learning Representations) 2021

    Paper Slides Video Code Blog Post

    Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other’s data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
    @inproceedings{capc2021iclr,
      title = {CaPC Learning: Confidential and Private Collaborative Learning},
      author = {Choquette-Choo, Christopher A. and Dullerud, Natalie and Dziedzic, Adam and Zhang, Yunxiang and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
      booktitle = {ICLR (International Conference on Learning Representations)},
      year = {2021}
    }
    
  18. Pretrained Transformers Improve Out-of-Distribution Robustness
    Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song
    In ACL (Association for Computational Linguistics) 2020

    Paper Slides Video Code

    Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers’ performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
    @inproceedings{hendrycks-etal-2020-pretrained,
      title = {Pretrained Transformers Improve Out-of-Distribution Robustness},
      author = {Hendrycks, Dan and Liu, Xiaoyuan and Wallace, Eric and Dziedzic, Adam and Krishnan, Rishabh and Song, Dawn},
      booktitle = { ACL (Association for Computational Linguistics)},
      month = jul,
      year = {2020},
      address = {Online},
      publisher = {ACL (Association for Computational Linguistics)},
      doi = {10.18653/v1/2020.acl-main.244},
      pages = {2744--2751}
    }
    
  19. Band-limited Training and Inference for Convolutional Neural Networks
    Adam Dziedzic, Ioannis Paparizzos, Sanjay Krishnan, Aaron Elmore, Michael Franklin
    In ICML (International Conference on Machine Learning) 2019

    Paper Slides Video

    The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.
    @inproceedings{dziedzic2019band,
      title = {Band-limited Training and Inference for Convolutional Neural Networks},
      author = {Dziedzic, Adam and Paparizzos, Ioannis and Krishnan, Sanjay and Elmore, Aaron and Franklin, Michael},
      booktitle = {ICML (International Conference on Machine Learning)},
      year = {2019}
    }
    
  20. Columnstore and B+ Tree - Are Hybrid Physical Designs Important?
    Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, Manoj Syamala
    In SIGMOD (ACM Special Interest Group on Management of Data) 2018

    Paper Slides

    Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied — a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.
    @inproceedings{dziedzic2018index,
      title = {Columnstore and B+ Tree - Are Hybrid Physical Designs Important?},
      author = {Dziedzic, Adam and Wang, Jingjing and Das, Sudipto and Ding, Bolin and Narasayya, Vivek R. and Syamala, Manoj},
      booktitle = {SIGMOD (ACM Special Interest Group on Management of Data)},
      year = {2018}
    }
    
  21. Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.
    Tim Mattson, Vijay Gadepally, Zuohao She, Adam Dziedzic, Jeff Parkhurst
    In CIDR (Conference on Innovative Data Systems Research) 2017

    Paper

    In most Big Data applications, the data is heterogeneous. As we have been arguing in a series of papers, storage engines should be well suited to the data they hold. Therefore, a system supporting Big Data applications should be able to expose multiple storage engines through a single interface. We call such systems, polystore systems. Our reference implementation of the polystore concept is called BigDAWG (short for the Big Data Analytics Working Group). In this demonstration, we will show the BigDAWG system and a number of polystore applications built to help ocean metage-nomics researchers handle their heterogenous Big Data.
    @inproceedings{mattson2017demonstrating,
      title = {Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.},
      author = {Mattson, Tim and Gadepally, Vijay and She, Zuohao and Dziedzic, Adam and Parkhurst, Jeff},
      booktitle = {CIDR (Conference on Innovative Data Systems Research)},
      year = {2017}
    }
    
  22. DBMS Data Loading: An Analysis on Modern Hardware
    Adam Dziedzic, Manos Karpathiotakis, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki
    In ADMS (Accelerating analytics and Data Management Systems) 2016

    Paper Slides

    Data loading has traditionally been considered a one-time deal - an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the one-off assumption, and has turned data loading into a major bottleneck of the data analysis pipeline. This paper analyzes the behavior of modern DBMSs in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMSs, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMSs are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.
    @inproceedings{dziedzic2016dbms,
      title = {DBMS Data Loading: An Analysis on Modern Hardware},
      author = {Dziedzic, Adam and Karpathiotakis, Manos and Alagiannis, Ioannis and Appuswamy, Raja and Ailamaki, Anastasia},
      booktitle = {ADMS (Accelerating analytics and Data Management Systems)},
      year = {2016}
    }
    
  23. Data Transformation and Migration in Polystores
    Adam Dziedzic, Aaron Elmore, Michael Stonebraker
    In HPEC (IEEE High Performance Extreme Computing) 2016

    Paper Poster Slides

    Ever increasing data size and new requirements in data processing has fostered the development of many new database systems. The result is that many data-intensive applications are underpinned by different engines. To enable data mobility there is a need to transfer data between systems easily and efficiently. We analyze the state-of-the-art of data migration and outline research opportunities for a rapid data transfer. Our experiments explore data migration between a diverse set of databases, including PostgreSQL, SciDB, S-Store and Accumulo. Each of the systems excels at specific application requirements, such as transactional processing, numerical computation, streaming data, and large scale text processing. Providing an efficient data migration tool is essential to take advantage of superior processing from that specialized databases. Our goal is to build such a data migration framework that will take advantage of recent advancement in hardware and software.
    @inproceedings{dziedzic2016transformation,
      title = {Data Transformation and Migration in Polystores},
      author = {Dziedzic, Adam and Elmore, Aaron and Stonebraker, Michael},
      booktitle = {HPEC (IEEE High Performance Extreme Computing)},
      year = {2016},
      organization = {IEEE}
    }
    
  24. BigDAWG: a Polystore for Diverse Interactive Applications
    Adam Dziedzic, Jennie Duggan, Aaron J. Elmore, Vijay Gadepally, Michael Stonebraker
    In DSIA (IEEE Viz Data Systems for Interactive Analysis) 2015

    Paper

    Interactive analytics requires low latency queries in the presence of diverse, complex, and constantly evolving workloads. To address these challenges, we introduce a polystore, BigDAWG, that tightly couples diverse database systems, data models, and query languages through use of semantically grouped Islands of Information. BigDAWG, which stands for the Big Data Working Group, seeks to provide location transparency by matching the right system for each workload using black-box model of query and system performance. In this paper we introduce BigDAWG as a solution to diverse web-based interactive applications and motivate our key challenges in building BigDAWG. BigDAWG continues to evolve and, where applicable, we have noted the current status of its implementation.
    @inproceedings{dziedzic2015bigdawg,
      title = {BigDAWG: a Polystore for Diverse Interactive Applications},
      author = {Dziedzic, Adam and Duggan, Jennie and Elmore, Aaron J. and Gadepally, Vijay and Stonebraker, Michael},
      booktitle = {DSIA (IEEE Viz Data Systems for Interactive Analysis)},
      year = {2015}
    }
    

Research Talks

Experience

CISPA Helmholtz Center for Information Security

September 2023 - current: Tenure Track Faculty Member

My research is focused on secure and trustworthy Machine Learning as a Service (MLaaS). I design robust and reliable machine learning methods for training and inference of ML models while preserving data privacy and model confidentiality.

University of Toronto & Vector Institute

September 2020 - August 2023: Postdoctoral researcher

Research on collaborative, private, and robust Machine Learning.

University of Chicago

July 2015 - August 2020: PhD Student

Research on the intersection of robust machine learning and database management systems (DBMSs).

Google

June - September 2017: PhD Software Engineering Intern at Data Infrastructure and Analysis team

Research on graceful degradation and avoidance of performance cliffs in the F1 system.

Microsoft Research

March - June 2017: Research Intern at Data Management, Exploration and Mining group

Carried out research on hybrid physical designs for diverse workloads.

EPFL

October 2014 - June 2015: Research Intern

Research on data loading to diverse database management systems.

Warsaw University of Technology

October 2007 - September 2014: Bachelor and Master's Student

I was granted the academic scholarship for the best faculty students (based on GPA).

Barclays Investment Bank

June - August 2013: Intern Analyst

Created a system for validating and suggesting underlyings for complex financial products.

CERN

April - December 2012: Technical Student at IT Department

Designed a system to store information on configuration and management of devices at computer center.

Mobile Startup

March 2012: Udarnik

Worked on an application providing aspects of music social interactions.

Technical University of Denmark

August 2010 - January 2011: Erasmus Student

Applied statistics, Web 2.0 and mobile interactions, spatial databases, logic programming.

Tekten

July 2010: Designer and Software Engineer

Designed a database and developed application for a telecom company in Java and PL/SQL.

Torn

July - September 2009: Software Engineer

Worked on a financial and accounting system project in Java and Oracle 10g.

Projects

Collaborative Learning in ML

Collaborative Learning in ML

Confidential and Private Collaborative (CaPC) learning is the first method provably achieving both confidentiality and privacy in a collaborative setting by using techniques from cryptography and differential privacy literature.

...   ...
paperpaper slidesslides talktalk bibtexbibtex
Band-limited Training and Inference For Convolutional Nerual Networks

Band-limited Training and Inference For Convolutional Nerual Networks

The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.

...   ...
paperpaper slidesslides talktalk bibtexbibtex
Auto-recommendation of hybrid physical designs

Auto-recommendation of hybrid physical designs

We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.

paperpaper slidesslides bibtexbibtex
BigDAWG

BigDAWG

An open source project from researchers within the Intel Science and Technology Center for Big Data (ISTC). BigDAWG is a reference implementation of a polystore database. A polystore system is any database management system (DBMS) that is built on top of multiple, heterogeneous, integrated storage engines. I worked on the scaffolding of the system and then implemented a cast operator to move data between diverse DBMSs.

...   ...
paperpaper slidesslides bibtexbibtex
Data Loading

Data Loading

An automated testing infrastructure was built to benchmark the loading performance of several commercial and open-source databases, perform an in-depth analysis to identify bottlenecks of the data loading process and investigate novel techniques which could be used to accelerate DBMS data loading.

paperpaper slidesslides bibtexbibtex

Contact

Adam Dziedzic

The best way to contact me is through email.

My public PGP key.

© 2024 Adam Dziedzic