Selected Publications

Please also check the Full List of my publications and the Google Scholar profile.

Privacy Attacks on Image AutoRegressive Models
Antoni Kowalczuk, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
In Forty-Second International Conference on Machine Learning (ICML) 2025
Paper Code

We design new methods to assess the privacy leakage from the image autoregressive models and show that they provide better performance, however, also leak more private information than diffusion models.

Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) matching state-of-the-art diffusion models (DMs) in image quality (FID: 1.48 vs. 1.58) while allowing for a higher generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a True Positive Rate at False Positive Rate = 1% of 86.38% vs. 6.38% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results suggest a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they are empirically significantly more vulnerable to privacy attacks compared to DMs that achieve similar performance.
```
@inproceedings{kowalczuk2025privacyIARs,
  title = {Privacy Attacks on Image AutoRegressive Models},
  author = {Kowalczuk, Antoni and Dubiński, Jan and Boenisch, Franziska and Dziedzic, Adam},
  year = {2025},
  booktitle = {Forty-Second International Conference on Machine Learning (ICML)}
}
```
Unlocking Post-hoc Dataset Inference with Synthetic Data
Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic
In Forty-Second International Conference on Machine Learning (ICML) 2025
Paper Code

LLM dataset inference method without any i.i.d. held-out set by generating this set and performing post-hoc calibration.

The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that our synthetic held-out set enables DI to detect the training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations.
```
@inproceedings{zhao2025posthocDI,
  title = {Unlocking Post-hoc Dataset Inference with Synthetic Data},
  author = {Zhao, Bihe and Maini, Pratyush and Boenisch, Franziska and Dziedzic, Adam},
  year = {2025},
  booktitle = {Forty-Second International Conference on Machine Learning (ICML)}
}
```
Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs
Xun Wang, Jing Xu, Franziska Boenisch, Michael Backes, Christopher A. Choquette-Choo, Adam Dziedzic
In Forty-Second International Conference on Machine Learning (ICML) 2025
Paper Code

A method for efficient and privacy-preserving transfer of soft prompts tuned on a distilled small model to a larger model using public data.

Prompting has become a dominant paradigm for adapting large language models (LLMs). While discrete (textual) prompts are widely used for their interpretability, soft (parameter) prompts have recently gained traction in APIs. This is because they can encode information from more training samples while minimizing the user’s token usage, leaving more space in the context window for task-specific input. However, soft prompts are tightly coupled to the LLM they are tuned on, limiting their generalization to other LLMs. This constraint is particularly problematic for efficiency and privacy: (1) tuning prompts on each LLM incurs high computational costs, especially as LLMs continue to grow in size. Additionally, (2) when the LLM is hosted externally, soft prompt tuning often requires sharing private data with the LLM provider. For instance, this is the case with the NVIDIA NeMo API. To address these issues, we propose POST (Privacy Of Soft prompt Transfer), a framework that enables private tuning of soft prompts on a small model and subsequently transfers these prompts to a larger LLM. POST uses knowledge distillation to derive a small model directly from the large LLM to improve prompt transferability, tunes the soft prompt locally, optionally with differential privacy guarantees, and transfers it back to the larger LLM using a small public dataset. Our experiments show that POST reduces computational costs, preserves privacy, and effectively transfers high-utility soft prompts.
```
@inproceedings{wang2025post,
  title = {Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs},
  author = {Wang, Xun and Xu, Jing and Boenisch, Franziska and Backes, Michael and Choquette-Choo, Christopher A. and Dziedzic, Adam},
  year = {2025},
  booktitle = {Forty-Second International Conference on Machine Learning (ICML)}
}
```
CDI: Copyrighted Data Identification in Diffusion Models
Jan Dubiński, Antoni Kowalczuk, Franziska Boenisch, Adam Dziedzic
In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) 2025
Paper Code

Copyrighted Data Identification (CDI), a framework for data owners to identify whether their dataset was used to train a given Diffusion Model. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point.

Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose Copyrighted Data Identification (CDI), a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data points which might all be included in the training of a given DM. By selectively aggregating signals from existing MIAs and using new handcrafted methods to extract features from these datasets, feeding them to a scoring model, and applying rigorous statistical testing, CDI allows data owners with as little as 70 data points to identify with a confidence of more than 99% whether their data was used to train a given DM. Thereby, CDI represents a valuable tool for data owners to claim illegitimate use of their copyrighted data.
```
@inproceedings{dubinski2025cdi,
  author = {Dubi\'nski, Jan and Kowalczuk, Antoni and Boenisch, Franziska and Dziedzic, Adam},
  title = {CDI: Copyrighted Data Identification in Diffusion Models},
  booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
  year = {2025},
  pages = {18674-18684}
}
```
Secure Noise Sampling for Differentially Private Collaborative Learning
Olive Franzese, Congyu Fang, Radhika Garg, Somesh Jha, Nicolas Papernot, Xiao Wang, Adam Dziedzic
In The ACM Conference on Computer and Communications Security (CCS) 2025
Paper

We present a novel protocol for secure noise sampling that is efficient and compatible with arbitrary discrete distributions. Our method achieves significant runtime improvements and requires much less communication compared to previous work, especially at higher numbers of parties.

Differentially private stochastic gradient descent (DP-SGD) trains machine learning (ML) models with formal privacy guarantees for the training set by adding random noise to gradient updates. In collaborative learning (CL), where multiple parties jointly train a model, noise addition occurs either (i) before or (ii) during secure gradient aggregation. The first option is deployed in distributed DP methods, which require greater amounts of total noise to achieve security, resulting in degraded model utility. The second approach preserves model utility but requires a secure multiparty computation (MPC) protocol. Existing methods for MPC noise generation require tens to hundreds of seconds of runtime per noise sample because of the number of parties involved. This makes them impractical for collaborative learning, which often requires thousands or more samples of noise in each training step. We present a novel protocol for MPC noise sampling tailored to the collaborative learning setting. It works by constructing an approximation of the distribution of interest which can be efficiently sampled by a series of table lookups. Our method achieves significant runtime improvements and requires much less communication compared to previous work, especially at higher numbers of parties. It is also highly flexible – while previous MPC sampling methods tend to be optimized for specific distributions, we prove that our method can generically sample noise from statistically close approximations of arbitrary discrete distributions. This makes it compatible with a wide variety of DP mechanisms. Our experiments demonstrate the efficiency and utility of our method applied to a discrete Gaussian mechanism for differentially private collaborative learning. For 16 parties, we achieve a runtime of 0.06 seconds and 11.59 MB total communication per sample, a 230 times runtime improvement and 3 times less communication compared to the prior state-of-the-art for sampling from discrete Gaussian distribution in MPC.
```
@inproceedings{franzese2025noisegeneration,
  author = {Franzese, Olive and Fang, Congyu and Garg, Radhika and Jha, Somesh and Papernot, Nicolas and Wang, Xiao and Dziedzic, Adam},
  title = {Secure Noise Sampling for Differentially Private Collaborative Learning},
  year = {2025},
  booktitle = {The ACM Conference on Computer and Communications Security (CCS)}
}
```
Differentially Private Federated Learning with Time-Adaptive Privacy Spending
Shahrzad Kiani, Nupur Kulkarni, Adam Dziedzic, Stark Draper, Franziska Boenisch
In The Thirteenth International Conference on Learning Representations (ICLR) 2025
Paper

Federated learning (FL) with differential privacy (DP) provides a framework for collaborative machine learning, enabling clients to train a shared model while adhering to strict privacy constraints. The framework allows each client to have an individual privacy guarantee, e.g., by adding different amounts of noise to each client’s model updates. One underlying assumption is that all clients spend their privacy budgets uniformly over time (learning rounds). However, it has been shown in the literature that learning in early rounds typically focuses on more coarse-grained features that can be learned at lower signal-to-noise ratios while later rounds learn fine-grained features that benefit from higher signal-to-noise ratios. Building on this intuition, we propose a time-adaptive DP-FL framework that expends the privacy budget non-uniformly across both time and clients. Our framework enables each client to save privacy budget in early rounds so as to be able to spend more in later rounds when additional accuracy is beneficial in learning more fine-grained features. We theoretically prove utility improvements in the case that clients with stricter privacy budgets spend budgets unevenly across rounds, compared to clients with more relaxed budgets, who have sufficient budgets to distribute their spend more evenly. Our practical experiments on standard benchmark datasets support our theoretical results and show that, in practice, our algorithms improve the privacy-utility trade-offs compared to baseline schemes.
```
@inproceedings{kiani2025differentially,
  title = {Differentially Private Federated Learning with Time-Adaptive Privacy Spending},
  author = {Kiani, Shahrzad and Kulkarni, Nupur and Dziedzic, Adam and Draper, Stark and Boenisch, Franziska},
  booktitle = {The Thirteenth International Conference on Learning Representations (ICLR)},
  year = {2025}
}
```
Captured by Captions: On Memorization and its Mitigation in CLIP Models
Wenhao Wang, Adam Dziedzic, Grace C. Kim, Michael Backes, Franziska Boenisch
In The Thirteenth International Conference on Learning Representations (ICLR) 2025
Paper

Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP’s memorization behavior falls between the supervised and self-supervised paradigms, with "mis-captioned" samples exhibiting highest levels of memorization. Additionally, we find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain. Building on these insights, we propose multiple strategies to reduce memorization while at the same time improving utility—something that had not been shown before for traditional learning paradigms where reducing memorization typically results in utility decrease.
```
@inproceedings{wang2025captured,
  title = {Captured by Captions: On Memorization and its Mitigation in {CLIP} Models},
  author = {Wang, Wenhao and Dziedzic, Adam and Kim, Grace C. and Backes, Michael and Boenisch, Franziska},
  booktitle = {The Thirteenth International Conference on Learning Representations (ICLR)},
  year = {2025}
}
```
Precise Parameter Localization for Textual Generation in Diffusion Models
Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic
In The Thirteenth International Conference on Learning Representations (ICLR) 2025
Paper Code

Novel diffusion models (DMs) can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of DMs’ parameters contained in attention layers influence the generation of textual content within the images. Building on this observation, by precisely targeting cross and joint attention layers of DMs, we improve the efficiency and performance of textual generation. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large DMs while preserving the quality and diversity of the DMs’ generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP and the large language models like T5).
```
@inproceedings{staniszewski2025precise,
  title = {Precise Parameter Localization for Textual Generation in Diffusion Models},
  author = {Staniszewski, Łukasz and Cywi{\'n}ski, Bartosz and Boenisch, Franziska and Deja, Kamil and Dziedzic, Adam},
  booktitle = {The Thirteenth International Conference on Learning Representations (ICLR)},
  year = {2025}
}
```
Differentially Private Prototypes for Private Transfer Learning
Dariush Wahdany, Matthew Jagielski, Adam Dziedzic, Franziska Boenisch
In The 39th Annual AAAI Conference on Artificial Intelligence 2025
Paper

Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DP-SGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy (epsilon<1) and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of pure DP. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL’s high performance under strong privacy guarantees in challenging private learning setups.
```
@inproceedings{wahdany2025dppl,
  title = {Differentially Private Prototypes for Private Transfer Learning},
  author = {Wahdany, Dariush and Jagielski, Matthew and Dziedzic, Adam and Boenisch, Franziska},
  booktitle = {The 39th Annual AAAI Conference on Artificial Intelligence},
  year = {2025}
}
```
Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models
Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Katherine Lee, Milad Nasr, Sahra Ghalebikesabi, Niloofar Mireshghallah, Meenatchi Sundaram Mutu Selva Annamalai, Igor Shilov, Matthieu Meeus, Yves-Alexandre Montjoye, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper
In Preprint 2025
Paper

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today’s LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.
```
@inproceedings{hayes2025strongMIALLMs,
  title = {Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models},
  author = {Hayes, Jamie and Shumailov, Ilia and Choquette-Choo, Christopher A. and Jagielski, Matthew and Kaissis, George and Lee, Katherine and Nasr, Milad and Ghalebikesabi, Sahra and Mireshghallah, Niloofar and Annamalai, Meenatchi Sundaram Mutu Selva and Shilov, Igor and Meeus, Matthieu and de Montjoye, Yves-Alexandre and Boenisch, Franziska and Dziedzic, Adam and Cooper, A. Feder},
  booktitle = {Preprint},
  year = {2025},
  eprint = {2505.18773},
  archiveprefix = {arXiv},
  primaryclass = {cs.CR}
}
```
Memorization in Self-Supervised Learning Improves Downstream Generalization
Wenhao Wang, Muhammad Ahmad Kaleem, Adam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch
In The Twelfth International Conference on Learning Representations (ICLR) 2024
Paper Poster Code

Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data—often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations—both known in supervised learning as regularization techniques that reduce overfitting—still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.
```
@inproceedings{wang2024memorization,
  title = {Memorization in Self-Supervised Learning Improves Downstream Generalization},
  author = {Wang, Wenhao and Kaleem, Muhammad Ahmad and Dziedzic, Adam and Backes, Michael and Papernot, Nicolas and Boenisch, Franziska},
  booktitle = {The Twelfth International Conference on Learning Representations (ICLR)},
  year = {2024}
}
```
Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data
Congyu Fang, Adam Dziedzic, Lin Zhang, Laura Oliva, Amol Verma, Fahad Razak, Nicolas Papernot, Bo Wang
In eBioMedicine 2024
Paper

Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.
```
@inproceedings{fang2024collaborative,
  title = {Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data},
  author = {Fang, Congyu and Dziedzic, Adam and Zhang, Lin and Oliva, Laura and Verma, Amol and Razak, Fahad and Papernot, Nicolas and Wang, Bo},
  booktitle = {eBioMedicine},
  year = {2024}
}
```
Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives
Vincent Hanke, Tom Blanchard, Franziska Boenisch, Iyiola Emmanuel Olatunji, Michael Backes, Adam Dziedzic
In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024
Paper Poster Slides Video Code Blog Post

While open Large Language Models (LLMs) have made significant progress, they still fall short of matching the performance of their closed, proprietary counterparts, making the latter attractive even for the use on highly private data. Recently, various new methods have been proposed to adapt closed LLMs to private data without leaking private information to third parties and/or the LLM provider. In this work, we analyze the privacy protection and performance of the four most recent methods for private adaptation of closed LLMs. By examining their threat models and thoroughly comparing their performance under different privacy levels according to differential privacy (DP), various LLM architectures, and multiple datasets for classification and generation tasks, we find that: (1) all the methods leak query data, i.e., the (potentially sensitive) user data that is queried at inference time, to the LLM provider, (2) three out of four methods also leak large fractions of private training data to the LLM provider while the method that protects private data requires a local open LLM, (3) all the methods exhibit lower performance compared to three private gradient-based adaptation methods for local open LLMs, and (4) the private adaptation methods for closed LLMs incur higher monetary training and query costs than running the alternative methods on local open LLMs. This yields the conclusion that, to achieve truly privacy-preserving LLM adaptations that yield high performance and more privacy at lower costs, taking into account current methods and models, one should use open LLMs.
```
@inproceedings{hanke2024openLLMs,
  title = {Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives},
  author = {Hanke, Vincent and Blanchard, Tom and Boenisch, Franziska and Olatunji, Iyiola Emmanuel and Backes, Michael and Dziedzic, Adam},
  year = {2024},
  booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
}
```
LLM Dataset Inference: Did you train on my dataset?
Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic
In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024
Paper

The proliferation of large language models (LLMs) in the real world has come with a rise in copyright cases against companies for training their models on unlicensed data from the internet. Recent works have presented methods to identify if individual text sequences were members of the model’s training data, known as membership inference attacks (MIAs). We demonstrate that the apparent success of these MIAs is confounded by selecting non-members (text sequences not used for training) belonging to a different distribution from the members (e.g., temporally shifted recent Wikipedia articles compared with ones used to train the model). This distribution shift makes membership inference appear successful. However, most MIA methods perform no better than random guessing when discriminating between members and non-members from the same distribution (e.g., in this case, the same period of time). Even when MIAs work, we find that different MIAs succeed at inferring membership of samples from different distributions. Instead, we propose a new dataset inference method to accurately identify the datasets used to train large language models. This paradigm sits realistically in the modern-day copyright landscape, where authors claim that an LLM is trained over multiple documents (such as a book) written by them, rather than one particular paragraph. While dataset inference shares many of the challenges of membership inference, we solve it by selectively combining the MIAs that provide positive signal for a given distribution, and aggregating them to perform a statistical test on a given dataset. Our approach successfully distinguishes the train and test sets of different subsets of the Pile with statistically significant p-values < 0.1, without any false positives.
```
@inproceedings{maini2024LLMDatasetInference,
  title = {LLM Dataset Inference: Did you train on my dataset?},
  author = {Maini, Pratyush and Jia, Hengrui and Papernot, Nicolas and Dziedzic, Adam},
  year = {2024},
  booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
}
```
Localizing Memorization in SSL Vision Encoders
Wenhao Wang, Adam Dziedzic, Michael Backes, Franziska Boenisch
In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024
Paper

Recent work on studying memorization in self-supervised learning (SSL) suggests that even though SSL encoders are trained on millions of images, they still memorize individual data points. While effort has been put into characterizing the memorized data and linking encoder memorization to downstream utility, little is known about where the memorization happens inside SSL encoders. To close this gap, we propose two metrics for localizing memorization in SSL encoders on a per-layer (layermem) and per-unit basis (unitmem). Our localization methods are independent of the downstream task, do not require any label information, and can be performed in a forward pass. By localizing memorization in various encoder architectures (convolutional and transformer-based) trained on diverse datasets with contrastive and non-contrastive SSL frameworks, we find that (1) while SSL memorization increases with layer depth, highly memorizing units are distributed across the entire encoder, (2) a significant fraction of units in SSL encoders experiences surprisingly high memorization of individual data points, which is in contrast to models trained under supervision, (3) atypical (or outlier) data points cause much higher layer and unit memorization than standard data points, and (4) in vision transformers, most memorization happens in the fully-connected layers. Finally, we show that localizing memorization in SSL has the potential to improve fine-tuning and to inform pruning strategies.
```
@inproceedings{wang2024LocalizeMemorizationSSL,
  title = {Localizing Memorization in SSL Vision Encoders},
  author = {Wang, Wenhao and Dziedzic, Adam and Backes, Michael and Boenisch, Franziska},
  year = {2024},
  booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
}
```
Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models
Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch
In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024
Paper

Diffusion models (DMs) produce very detailed and high-quality images. Their power results from extensive training on large amounts of data, usually scraped from the internet without proper attribution or consent from content creators. Unfortunately, this practice raises privacy and intellectual property concerns, as DMs can memorize and later reproduce their potentially sensitive or copyrighted training images at inference time. Prior efforts prevent this issue by either changing the input to the diffusion process, thereby preventing the DM from generating memorized samples during inference, or removing the memorized data from training altogether. While those are viable solutions when the DM is developed and deployed in a secure and constantly monitored environment, they hold the risk of adversaries circumventing the safeguards and are not effective when the DM itself is publicly released. To solve the problem, we introduce NeMo, the first method to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers. Through our experiments, we make the intriguing finding that in many cases, single neurons are responsible for memorizing particular training samples. By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data. In this way, our NeMo contributes to a more responsible deployment of DMs.
```
@inproceedings{hintersdorf2024MemorizationDiffusionModels,
  title = {Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models},
  author = {Hintersdorf, Dominik and Struppek, Lukas and Kersting, Kristian and Dziedzic, Adam and Boenisch, Franziska},
  year = {2024},
  booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
}
```
Have it your way: Individualized Privacy Assignment for DP-SGD
Franziska Boenisch, Christopher Mühl, Adam Dziedzic, Roy Rinberg, Nicolas Papernot
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper

When training a machine learning model with differential privacy, one sets a privacy budget. This budget represents a maximal privacy violation that any user is willing to face by contributing their data to the training set. We argue that this approach is limited because different users may have different privacy expectations. Thus, setting a uniform privacy budget across all points may be overly conservative for some users or, conversely, not sufficiently protective for others. In this paper, we capture these preferences through individualized privacy budgets. To demonstrate their practicality, we introduce a variant of Differentially Private Stochastic Gradient Descent (DP-SGD) which supports such individualized budgets. DP-SGD is the canonical approach to training models with differential privacy. We modify its data sampling and gradient noising mechanisms to arrive at our approach, which we call Individualized DP-SGD (IDP-SGD). Because IDP-SGD provides privacy guarantees tailored to the preferences of individual users and their data points, we find it empirically improves privacy-utility trade-offs.
```
@inproceedings{boenisch2023idpsgd,
  title = {Have it your way: Individualized Privacy Assignment for DP-SGD},
  author = {Boenisch, Franziska and Mühl, Christopher and Dziedzic, Adam and Rinberg, Roy and Papernot, Nicolas},
  year = {2023},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  eprint = {2303.17046},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG}
}
```
On the privacy risk of in-context learning
Haonan Duan, Adam Dziedzic, Mohammad Yaghini, Nicolas Papernot, Franziska Boenisch
In The 61st Annual Meeting Of The Association For Computational Linguistics 2023
Paper

Large language models (LLMs) are excellent few-shot learners. They can perform a wide variety of tasks purely based on natural language prompts provided to them. These prompts contain data of a specific downstream task—often the private dataset of a party, e.g., a company that wants to leverage the LLM on their purposes. We show that deploying prompted models presents a significant privacy risk for the data used within the prompt by instantiating a highly effective membership inference attack. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels. After identifying the model’s sensitivity to their prompts—in form of a significantly higher prediction confidence on the prompted data—as a cause for the increased risk, we propose ensembling as a mitigation strategy. By aggregating over multiple different versions of a prompted model, membership inference risk can be decreased.
```
@inproceedings{duan2023privacyICL,
  title = {On the privacy risk of in-context learning},
  author = {Duan, Haonan and Dziedzic, Adam and Yaghini, Mohammad and Papernot, Nicolas and Boenisch, Franziska},
  booktitle = {The 61st Annual Meeting Of The Association For Computational Linguistics},
  year = {2023}
}
```
Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees
Franziska Boenisch, Christopher Mühl, Roy Rinberg, Jannis Ihrig, Adam Dziedzic
In Privacy Enhancing Technologies Symposium (PETS) 2023
Paper

Applying machine learning (ML) to sensitive domains requires privacy protection of the underlying training data through formal privacy frameworks, such as differential privacy (DP). Yet, usually, the privacy of the training data comes at the cost of the resulting ML models’ utility. One reason for this is that DP uses one uniform privacy budget epsilon for all training data points, which has to align with the strictest privacy requirement encountered among all data holders. In practice, different data holders have different privacy requirements and data points of data holders with lower requirements can contribute more information to the training process of the ML models. To account for this need, we propose two novel methods based on the Private Aggregation of Teacher Ensembles (PATE) framework to support the training of ML models with individualized privacy guarantees. We formally describe the methods, provide a theoretical analysis of their privacy bounds, and experimentally evaluate their effect on the final model’s utility using the MNIST, SVHN, and Adult income datasets. Our empirical results show that the individualized privacy methods yield ML models of higher accuracy than the non-individualized baseline. Thereby, we improve the privacy-utility trade-off in scenarios in which different data holders consent to contribute their sensitive data at different individual privacy levels.
```
@inproceedings{pate2023pets,
  author = {Boenisch, Franziska and Mühl, Christopher and Rinberg, Roy and Ihrig, Jannis and Dziedzic, Adam},
  title = {Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees},
  booktitle = {Privacy Enhancing Technologies Symposium (PETS)},
  year = {2023}
}
```
Private Multi-Winner Voting for Machine Learning
Adam Dziedzic, Christopher A Choquette-Choo, Natalie Dullerud, Vinith Menon Suriyakumar, Ali Shahin Shamsabadi, Muhammad Ahmad Kaleem, Somesh Jha, Nicolas Papernot, Xiao Wang
In Privacy Enhancing Technologies Symposium (PETS) 2023
Paper Slides Video Code

Private multi-winner voting is the task of revealing k-hot binary vectors satisfying a bounded differential privacy (DP) guarantee. This task has been understudied in machine learning literature despite its prevalence in many domains such as healthcare. We propose three new DP multi-winner mechanisms: Binary, τ, and Powerset voting. Binary voting operates independently per label through composition. τ voting bounds votes optimally in their ℓ2 norm for tight data-independent guarantees. Powerset voting operates over the entire binary vector by viewing the possible outcomes as a power set. Our theoretical and empirical analysis shows that Binary voting can be a competitive mechanism on many tasks unless there are strong correlations between labels, in which case Powerset voting outperforms it. We use our mechanisms to enable privacy-preserving multi-label learning in the central setting by extending the canonical single-label technique: PATE. We find that our techniques outperform current state-of-the-art approaches on large, real-world healthcare data and standard multi-label benchmarks. We further enable multi-label confidential and private collaborative (CaPC) learning and show that model performance can be significantly improved in the multi-site setting.
```
@inproceedings{multilabel2023pets,
  title = {Private Multi-Winner Voting for Machine Learning},
  author = {Dziedzic, Adam and Choquette-Choo, Christopher A and Dullerud, Natalie and Suriyakumar, Vinith Menon and Shamsabadi, Ali Shahin and Kaleem, Muhammad Ahmad and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
  booktitle = {Privacy Enhancing Technologies Symposium (PETS)},
  year = {2023}
}
```
Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders
Jan Dubiński, Stanisław Pawlak, Franziska Boenisch, Tomasz Trzcinski, Adam Dziedzic
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper Poster Slides Video Code Blog Post

Machine Learning as a Service (MLaaS) APIs provide ready-to-use and high-utility encoders that generate vector representations for given inputs. Since these encoders are very costly to train, they become lucrative targets for model stealing attacks during which an adversary leverages query access to the API to replicate the encoder locally at a fraction of the original training costs. We propose Bucks for Buckets (B4B), the first active defense that prevents stealing while the attack is happening without degrading representation quality for legitimate API users. Our defense relies on the observation that the representations returned to adversaries who try to steal the encoder’s functionality cover a significantly larger fraction of the embedding space than representations of legitimate users who utilize the encoder to solve a particular downstream task. B4B leverages this to adaptively adjust the utility of the returned representations according to a user’s coverage of the embedding space. To prevent adaptive adversaries from eluding our defense by simply creating multiple user accounts (sybils), B4B also individually transforms each user’s representations. This prevents the adversary from directly aggregating representations over multiple accounts to create their stolen encoder copy. Our active defense opens a new path towards securely sharing and democratizing encoders over public APIs.
```
@inproceedings{dubinski2023bucks,
  title = {Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders},
  author = {Dubiński, Jan and Pawlak, Stanisław and Boenisch, Franziska and Trzcinski, Tomasz and Dziedzic, Adam},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2023}
}
```
Robust and Actively Secure Serverless Collaborative Learning
Nicholas Franzese, Adam Dziedzic, Christopher A. Choquette-Choo, Mark R. Thomas, Muhammad Ahmad Kaleem, Stephan Rabanser, Congyu Fang, Somesh Jha, Nicolas Papernot, Xiao Wang
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper Poster

Collaborative machine learning (ML) is widely used to enable institutions to learn better models from distributed data. While collaborative approaches to learning intuitively protect user data, they remain vulnerable to either the server, the clients, or both, deviating from the protocol. Indeed, because the protocol is asymmetric, a malicious server can abuse its power to reconstruct client data points. Conversely, malicious clients can corrupt learning with malicious updates. Thus, both clients and servers require a guarantee when the other cannot be trusted to fully cooperate. In this work, we propose a peer-to-peer (P2P) learning scheme that is secure against malicious servers and robust to malicious clients. Our core contribution is a generic framework that transforms any (compatible) algorithm for robust aggregation of model updates to the setting where servers and clients can act maliciously. Finally, we demonstrate the computational efficiency of our approach even with 1-million parameter models trained by 100s of peers on standard datasets.
```
@inproceedings{franzeses2023p2pml,
  title = {Robust and Actively Secure Serverless Collaborative Learning},
  author = {Franzese, Nicholas and Dziedzic, Adam and Choquette-Choo, Christopher A. and Thomas, Mark R. and Kaleem, Muhammad Ahmad and Rabanser, Stephan and Fang, Congyu and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2023}
}
```
Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models
Haonan Duan, Adam Dziedzic, Nicolas Papernot, Franziska Boenisch
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper Slides Video Code Blog Post

Large language models (LLMs) are excellent in-context learners. However, the sensitivity of data contained in prompts raises privacy concerns. Our work first shows that these concerns are valid: we instantiate a simple but highly effective membership inference attack against the data used to prompt LLMs. To address this vulnerability, one could forego prompting and resort to fine-tuning LLMs with known algorithms for private gradient descent. However, this comes at the expense of the practicality and efficiency offered by prompting. Therefore, we propose to privately learn to prompt. We first show that soft prompts can be obtained privately through gradient descent on downstream data. However, this is not the case for discrete prompts. Thus, we orchestrate a noisy vote among an ensemble of LLMs presented with different prompts, i.e., a flock of stochastic parrots. The vote privately transfers the flock’s knowledge into a single public prompt. We show that LLMs prompted with our private algorithms closely match the non-private baselines. For example, using GPT3 as the base model, we achieve a downstream accuracy of 92.7% on the sst2 dataset with strong differential privacy guarantees vs. 95.2% for the non-private baseline. Through our experiments, we also show that our prompt-based approach is easily deployed with existing commercial APIs.
```
@inproceedings{duan2023flocks,
  title = {Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models},
  author = {Duan, Haonan and Dziedzic, Adam and Papernot, Nicolas and Boenisch, Franziska},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2023}
}
```
Dataset Inference for Self-Supervised Models
Adam Dziedzic, Haonan Duan, Muhammad Ahmad Kaleem, Nikita Dhawan, Jonas Guan, Yannis Cattan, Franziska Boenisch, Nicolas Papernot
In NeurIPS (Neural Information Processing Systems) 2022
Paper Slides Video Code

Self-supervised models are increasingly prevalent in machine learning (ML) since they reduce the need for expensively labeled data. Because of their versatility in downstream applications, they are increasingly used as a service exposed via public APIs. At the same time, these encoder models are particularly vulnerable to model stealing attacks due to the high dimensionality of vector representations they output. Yet, encoders remain undefended: existing mitigation strategies for stealing attacks focus on supervised learning. We introduce a new dataset inference defense, which uses the private training set of the victim encoder model to attribute its ownership in the event of stealing. The intuition is that the log-likelihood of an encoder’s output representations is higher on the victim’s training data than on test data if it is stolen from the victim, but not if it is independently trained. We compute this log-likelihood using density estimation models. As part of our evaluation, we also propose measuring the fidelity of stolen encoders and quantifying the effectiveness of the theft detection without involving downstream tasks; instead, we leverage mutual information and distance measurements. Our extensive empirical results in the vision domain demonstrate that dataset inference is a promising direction for defending self-supervised models against model stealing.
```
@inproceedings{datasetinference2022neurips,
  title = {Dataset Inference for Self-Supervised Models},
  author = {Dziedzic, Adam and Duan, Haonan and Kaleem, Muhammad Ahmad and Dhawan, Nikita and Guan, Jonas and Cattan, Yannis and Boenisch, Franziska and Papernot, Nicolas},
  booktitle = {NeurIPS (Neural Information Processing Systems)},
  year = {2022}
}
```
On the Difficulty of Defending Self-Supervised Learning against Model Extraction
Adam Dziedzic, Nikita Dhawan, Muhammad Ahmad Kaleem, Jonas Guan, Nicolas Papernot
In ICML (International Conference on Machine Learning) 2022
Paper Slides Video Code

Self-Supervised Learning (SSL) is an increasingly popular ML paradigm that trains models to transform complex inputs into representations without relying on explicit labels. These representations encode similarity structures that enable efficient learning of multiple downstream tasks. Recently, ML-as-a-Service providers have commenced offering trained SSL models over inference APIs, which transform user inputs into useful representations for a fee. However, the high cost involved to train these models and their exposure over APIs both make black-box extraction a realistic security threat. We thus explore model stealing attacks against SSL. Unlike traditional model extraction on classifiers that output labels, the victim models here output representations; these representations are of significantly higher dimensionality compared to the low-dimensional prediction scores output by classifiers. We construct several novel attacks and find that approaches that train directly on a victim’s stolen representations are query efficient and enable high accuracy for downstream models. We then show that existing defenses against model extraction are inadequate and not easily retrofitted to the specificities of SSL.
```
@inproceedings{sslextractions2022icml,
  title = {On the Difficulty of Defending Self-Supervised Learning against Model Extraction},
  author = {Dziedzic, Adam and Dhawan, Nikita and Kaleem, Muhammad Ahmad and Guan, Jonas and Papernot, Nicolas},
  booktitle = {ICML (International Conference on Machine Learning)},
  year = {2022}
}
```
Increasing the Cost of Model Extraction with Calibrated Proof of Work
Adam Dziedzic, Muhammad Ahmad Kaleem, Yu Shen Lu, Nicolas Papernot
In International Conference on Learning Representations (ICLR) [SPOTLIGTH] 2022
Paper Slides Video Code Blog Post

In model extraction attacks, adversaries can steal a machine learning model exposed via a public API by repeatedly querying it and adjusting their own model based on obtained predictions. To prevent model stealing, existing defenses focus on detecting malicious queries, truncating, or distorting outputs, thus necessarily introducing a tradeoff between robustness and model utility for legitimate users. Instead, we propose to impede model extraction by requiring users to complete a proof-of-work before they can read the model’s predictions. This deters attackers by greatly increasing (even up to 100x) the computational effort needed to leverage query access for model extraction. Since we calibrate the effort required to complete the proof-of-work to each query, this only introduces a slight overhead for regular users (up to 2x). To achieve this, our calibration applies tools from differential privacy to measure the information revealed by a query. Our method requires no modification of the victim model and can be applied by machine learning practitioners to guard their publicly exposed models against being easily stolen.
```
@inproceedings{pow2022iclr,
  title = {Increasing the Cost of Model Extraction with Calibrated Proof of Work},
  author = {Dziedzic, Adam and Kaleem, Muhammad Ahmad and Lu, Yu Shen and Papernot, Nicolas},
  booktitle = {International Conference on Learning Representations (ICLR) [SPOTLIGTH]},
  year = {2022}
}
```
CaPC Learning: Confidential and Private Collaborative Learning
Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, Xiao Wang
In International Conference on Learning Representations (ICLR) 2021
Paper Slides Video Code Blog Post

Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other’s data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
```
@inproceedings{capc2021iclr,
  title = {CaPC Learning: Confidential and Private Collaborative Learning},
  author = {Choquette-Choo, Christopher A. and Dullerud, Natalie and Dziedzic, Adam and Zhang, Yunxiang and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2021}
}
```
Pretrained Transformers Improve Out-of-Distribution Robustness
Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song
In ACL (Association for Computational Linguistics) 2020
Paper Slides Video Code

Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers’ performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
```
@inproceedings{hendrycks-etal-2020-pretrained,
  title = {Pretrained Transformers Improve Out-of-Distribution Robustness},
  author = {Hendrycks, Dan and Liu, Xiaoyuan and Wallace, Eric and Dziedzic, Adam and Krishnan, Rishabh and Song, Dawn},
  booktitle = { ACL (Association for Computational Linguistics)},
  year = {2020},
  address = {Online},
  publisher = {ACL (Association for Computational Linguistics)},
  doi = {10.18653/v1/2020.acl-main.244},
  pages = {2744--2751}
}
```
Band-limited Training and Inference for Convolutional Neural Networks
Adam Dziedzic, Ioannis Paparizzos, Sanjay Krishnan, Aaron Elmore, Michael Franklin
In ICML (International Conference on Machine Learning) 2019
Paper Slides Video

The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.
```
@inproceedings{dziedzic2019band,
  title = {Band-limited Training and Inference for Convolutional Neural Networks},
  author = {Dziedzic, Adam and Paparizzos, Ioannis and Krishnan, Sanjay and Elmore, Aaron and Franklin, Michael},
  booktitle = {ICML (International Conference on Machine Learning)},
  year = {2019}
}
```
Columnstore and B+ Tree - Are Hybrid Physical Designs Important?
Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, Manoj Syamala
In SIGMOD (ACM Special Interest Group on Management of Data) 2018
Paper Slides

Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied — a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.
```
@inproceedings{dziedzic2018index,
  title = {Columnstore and B+ Tree - Are Hybrid Physical Designs Important?},
  author = {Dziedzic, Adam and Wang, Jingjing and Das, Sudipto and Ding, Bolin and Narasayya, Vivek R. and Syamala, Manoj},
  booktitle = {SIGMOD (ACM Special Interest Group on Management of Data)},
  year = {2018}
}
```
Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.
Tim Mattson, Vijay Gadepally, Zuohao She, Adam Dziedzic, Jeff Parkhurst
In CIDR (Conference on Innovative Data Systems Research) 2017
Paper

In most Big Data applications, the data is heterogeneous. As we have been arguing in a series of papers, storage engines should be well suited to the data they hold. Therefore, a system supporting Big Data applications should be able to expose multiple storage engines through a single interface. We call such systems, polystore systems. Our reference implementation of the polystore concept is called BigDAWG (short for the Big Data Analytics Working Group). In this demonstration, we will show the BigDAWG system and a number of polystore applications built to help ocean metage-nomics researchers handle their heterogenous Big Data.
```
@inproceedings{mattson2017demonstrating,
  title = {Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.},
  author = {Mattson, Tim and Gadepally, Vijay and She, Zuohao and Dziedzic, Adam and Parkhurst, Jeff},
  booktitle = {CIDR (Conference on Innovative Data Systems Research)},
  year = {2017}
}
```
DBMS Data Loading: An Analysis on Modern Hardware
Adam Dziedzic, Manos Karpathiotakis, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki
In ADMS (Accelerating analytics and Data Management Systems) 2016
Paper Slides

Data loading has traditionally been considered a one-time deal - an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the one-off assumption, and has turned data loading into a major bottleneck of the data analysis pipeline. This paper analyzes the behavior of modern DBMSs in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMSs, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMSs are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.
```
@inproceedings{dziedzic2016dbms,
  title = {DBMS Data Loading: An Analysis on Modern Hardware},
  author = {Dziedzic, Adam and Karpathiotakis, Manos and Alagiannis, Ioannis and Appuswamy, Raja and Ailamaki, Anastasia},
  booktitle = {ADMS (Accelerating analytics and Data Management Systems)},
  year = {2016}
}
```
Data Transformation and Migration in Polystores
Adam Dziedzic, Aaron Elmore, Michael Stonebraker
In HPEC (IEEE High Performance Extreme Computing) 2016
Paper Poster Slides

Ever increasing data size and new requirements in data processing has fostered the development of many new database systems. The result is that many data-intensive applications are underpinned by different engines. To enable data mobility there is a need to transfer data between systems easily and efficiently. We analyze the state-of-the-art of data migration and outline research opportunities for a rapid data transfer. Our experiments explore data migration between a diverse set of databases, including PostgreSQL, SciDB, S-Store and Accumulo. Each of the systems excels at specific application requirements, such as transactional processing, numerical computation, streaming data, and large scale text processing. Providing an efficient data migration tool is essential to take advantage of superior processing from that specialized databases. Our goal is to build such a data migration framework that will take advantage of recent advancement in hardware and software.
```
@inproceedings{dziedzic2016transformation,
  title = {Data Transformation and Migration in Polystores},
  author = {Dziedzic, Adam and Elmore, Aaron and Stonebraker, Michael},
  booktitle = {HPEC (IEEE High Performance Extreme Computing)},
  year = {2016},
  organization = {IEEE}
}
```
BigDAWG: a Polystore for Diverse Interactive Applications
Adam Dziedzic, Jennie Duggan, Aaron J. Elmore, Vijay Gadepally, Michael Stonebraker
In DSIA (IEEE Viz Data Systems for Interactive Analysis) 2015
Paper

Interactive analytics requires low latency queries in the presence of diverse, complex, and constantly evolving workloads. To address these challenges, we introduce a polystore, BigDAWG, that tightly couples diverse database systems, data models, and query languages through use of semantically grouped Islands of Information. BigDAWG, which stands for the Big Data Working Group, seeks to provide location transparency by matching the right system for each workload using black-box model of query and system performance. In this paper we introduce BigDAWG as a solution to diverse web-based interactive applications and motivate our key challenges in building BigDAWG. BigDAWG continues to evolve and, where applicable, we have noted the current status of its implementation.
```
@inproceedings{dziedzic2015bigdawg,
  title = {BigDAWG: a Polystore for Diverse Interactive Applications},
  author = {Dziedzic, Adam and Duggan, Jennie and Elmore, Aaron J. and Gadepally, Vijay and Stonebraker, Michael},
  booktitle = {DSIA (IEEE Viz Data Systems for Interactive Analysis)},
  year = {2015}
}
```

Adam Dziedzic

Selected Publications

Please also check the Full List of my publications and the Google Scholar profile.

Research Talks

These video resources are a good overview of my research interests.

Private Adaptations for Large Language Models

Diferentially Private Prompt Learning for Large Language Models (LLMs)

Confidential and Private Collaborative Learning

Review on Adversarial ML

Band-limited Neural Networks

Overview: Adaptive & Robust Neural Networks

Increasing the Cost of Model Extraction with Calibrated Proof of Work

On the Difficulty of Defending Self-Supervised Learning against Model Extraction

Private Multi-Winner Voting for Machine Learning

Is this model mine? On stealing and defending machine learning models

Experience

CISPA Helmholtz Center for Information Security

University of Toronto & Vector Institute

University of Chicago

Google

Microsoft Research

EPFL

Warsaw University of Technology

Barclays Investment Bank

CERN

Mobile Startup

Technical University of Denmark

Tekten

Torn

Projects

Collaborative Learning in ML

Band-limited Training and Inference For Convolutional Nerual Networks

Auto-recommendation of hybrid physical designs

BigDAWG

Data Loading

Contact

Adam Dziedzic