publications

Publications

An up-to-date list is available on Google Scholar

2025

Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs
Xun Wang, Jing Xu, Franziska Boenisch, Michael Backes, Christopher A. Choquette-Choo, Adam Dziedzic
In Forty-Second International Conference on Machine Learning (ICML) 2025
Paper Code

A method for efficient and privacy-preserving transfer of soft prompts tuned on a distilled small model to a larger model using public data.

Prompting has become a dominant paradigm for adapting large language models (LLMs). While discrete (textual) prompts are widely used for their interpretability, soft (parameter) prompts have recently gained traction in APIs. This is because they can encode information from more training samples while minimizing the user’s token usage, leaving more space in the context window for task-specific input. However, soft prompts are tightly coupled to the LLM they are tuned on, limiting their generalization to other LLMs. This constraint is particularly problematic for efficiency and privacy: (1) tuning prompts on each LLM incurs high computational costs, especially as LLMs continue to grow in size. Additionally, (2) when the LLM is hosted externally, soft prompt tuning often requires sharing private data with the LLM provider. For instance, this is the case with the NVIDIA NeMo API. To address these issues, we propose POST (Privacy Of Soft prompt Transfer), a framework that enables private tuning of soft prompts on a small model and subsequently transfers these prompts to a larger LLM. POST uses knowledge distillation to derive a small model directly from the large LLM to improve prompt transferability, tunes the soft prompt locally, optionally with differential privacy guarantees, and transfers it back to the larger LLM using a small public dataset. Our experiments show that POST reduces computational costs, preserves privacy, and effectively transfers high-utility soft prompts.
```
@inproceedings{wang2025post,
  title = {Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs},
  author = {Wang, Xun and Xu, Jing and Boenisch, Franziska and Backes, Michael and Choquette-Choo, Christopher A. and Dziedzic, Adam},
  year = {2025},
  booktitle = {Forty-Second International Conference on Machine Learning (ICML)}
}
```
Precise Parameter Localization for Textual Generation in Diffusion Models
Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic
In The Thirteenth International Conference on Learning Representations 2025
Paper

Novel diffusion models (DMs) can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of DMs’ parameters contained in attention layers influence the generation of textual content within the images. Building on this observation, by precisely targeting cross and joint attention layers of DMs, we improve the efficiency and performance of textual generation. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large DMs while preserving the quality and diversity of the DMs’ generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP and the large language models like T5).
```
@inproceedings{staniszewski2025precise,
  title = {Precise Parameter Localization for Textual Generation in Diffusion Models},
  author = {Staniszewski, Łukasz and Cywi{\'n}ski, Bartosz and Boenisch, Franziska and Deja, Kamil and Dziedzic, Adam},
  booktitle = {The Thirteenth International Conference on Learning Representations},
  year = {2025}
}
```
Captured by Captions: On Memorization and its Mitigation in CLIP Models
Wenhao Wang, Adam Dziedzic, Grace C. Kim, Michael Backes, Franziska Boenisch
In The Thirteenth International Conference on Learning Representations 2025
Paper

Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP’s memorization behavior falls between the supervised and self-supervised paradigms, with "mis-captioned" samples exhibiting highest levels of memorization. Additionally, we find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain. Building on these insights, we propose multiple strategies to reduce memorization while at the same time improving utility—something that had not been shown before for traditional learning paradigms where reducing memorization typically results in utility decrease.
```
@inproceedings{wang2025captured,
  title = {Captured by Captions: On Memorization and its Mitigation in {CLIP} Models},
  author = {Wang, Wenhao and Dziedzic, Adam and Kim, Grace C. and Backes, Michael and Boenisch, Franziska},
  booktitle = {The Thirteenth International Conference on Learning Representations},
  year = {2025}
}
```
Differentially Private Federated Learning with Time-Adaptive Privacy Spending
Shahrzad Kiani, Nupur Kulkarni, Adam Dziedzic, Stark Draper, Franziska Boenisch
In The Thirteenth International Conference on Learning Representations (ICLR) 2025
Paper

Federated learning (FL) with differential privacy (DP) provides a framework for collaborative machine learning, enabling clients to train a shared model while adhering to strict privacy constraints. The framework allows each client to have an individual privacy guarantee, e.g., by adding different amounts of noise to each client’s model updates. One underlying assumption is that all clients spend their privacy budgets uniformly over time (learning rounds). However, it has been shown in the literature that learning in early rounds typically focuses on more coarse-grained features that can be learned at lower signal-to-noise ratios while later rounds learn fine-grained features that benefit from higher signal-to-noise ratios. Building on this intuition, we propose a time-adaptive DP-FL framework that expends the privacy budget non-uniformly across both time and clients. Our framework enables each client to save privacy budget in early rounds so as to be able to spend more in later rounds when additional accuracy is beneficial in learning more fine-grained features. We theoretically prove utility improvements in the case that clients with stricter privacy budgets spend budgets unevenly across rounds, compared to clients with more relaxed budgets, who have sufficient budgets to distribute their spend more evenly. Our practical experiments on standard benchmark datasets support our theoretical results and show that, in practice, our algorithms improve the privacy-utility trade-offs compared to baseline schemes.
```
@inproceedings{kiani2025differentially,
  title = {Differentially Private Federated Learning with Time-Adaptive Privacy Spending},
  author = {Kiani, Shahrzad and Kulkarni, Nupur and Dziedzic, Adam and Draper, Stark and Boenisch, Franziska},
  booktitle = {The Thirteenth International Conference on Learning Representations (ICLR)},
  year = {2025}
}
```
CDI: Copyrighted Data Identification in Diffusion Models
Jan Dubiński, Antoni Kowalczuk, Franziska Boenisch, Adam Dziedzic
In The IEEE CVF Computer Vision and Pattern Recognition Conference (CVPR) 2025
Paper

Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose CDI, a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data points which might all be included in the training of a given DM. By selectively aggregating signals from existing MIAs and using new handcrafted methods to extract features for these datasets, feeding them to a scoring model, and applying rigorous statistical testing, CDI allows data owners with as little as 70 data points to identify with a confidence of more than 99% whether their data was used to train a given DM. Thereby, CDI represents a valuable tool for data owners to claim illegitimate use of their copyrighted data.
```
@inproceedings{dubinski2024cdi,
  title = {CDI: Copyrighted Data Identification in Diffusion Models},
  author = {Dubiński, Jan and Kowalczuk, Antoni and Boenisch, Franziska and Dziedzic, Adam},
  booktitle = {The IEEE CVF Computer Vision and Pattern Recognition Conference (CVPR)},
  year = {2025}
}
```
Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models
Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Katherine Lee, Milad Nasr, Sahra Ghalebikesabi, Niloofar Mireshghallah, Meenatchi Sundaram Mutu Selva Annamalai, Igor Shilov, Matthieu Meeus, Yves-Alexandre Montjoye, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper
In Preprint 2025
Paper

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today’s LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.
```
@inproceedings{hayes2025strongMIALLMs,
  title = {Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models},
  author = {Hayes, Jamie and Shumailov, Ilia and Choquette-Choo, Christopher A. and Jagielski, Matthew and Kaissis, George and Lee, Katherine and Nasr, Milad and Ghalebikesabi, Sahra and Mireshghallah, Niloofar and Annamalai, Meenatchi Sundaram Mutu Selva and Shilov, Igor and Meeus, Matthieu and de Montjoye, Yves-Alexandre and Boenisch, Franziska and Dziedzic, Adam and Cooper, A. Feder},
  booktitle = {Preprint},
  year = {2025},
  eprint = {2505.18773},
  archiveprefix = {arXiv},
  primaryclass = {cs.CR}
}
```
Privacy Attacks on Image AutoRegressive Models
Antoni Kowalczuk, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
In Forty-Second International Conference on Machine Learning (ICML) 2025
Paper Code

We design new methods to assess the privacy leakage from the image autoregressive models and show that they provide better performance, however, also leak more private information than diffusion models.

Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) matching state-of-the-art diffusion models (DMs) in image quality (FID: 1.48 vs. 1.58) while allowing for a higher generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a True Positive Rate at False Positive Rate = 1% of 86.38% vs. 6.38% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results suggest a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they are empirically significantly more vulnerable to privacy attacks compared to DMs that achieve similar performance.
```
@inproceedings{kowalczuk2025privacyIARs,
  title = {Privacy Attacks on Image AutoRegressive Models},
  author = {Kowalczuk, Antoni and Dubiński, Jan and Boenisch, Franziska and Dziedzic, Adam},
  year = {2025},
  booktitle = {Forty-Second International Conference on Machine Learning (ICML)}
}
```
Unlocking Post-hoc Dataset Inference with Synthetic Data
Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic
In Forty-Second International Conference on Machine Learning (ICML) 2025
Paper Code

LLM dataset inference method without any i.i.d. held-out set by generating this set and performing post-hoc calibration.

The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that our synthetic held-out set enables DI to detect the training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations.
```
@inproceedings{zhao2025posthocDI,
  title = {Unlocking Post-hoc Dataset Inference with Synthetic Data},
  author = {Zhao, Bihe and Maini, Pratyush and Boenisch, Franziska and Dziedzic, Adam},
  year = {2025},
  booktitle = {Forty-Second International Conference on Machine Learning (ICML)}
}
```
Differentially Private Prototypes for Private Transfer Learning
Dariush Wahdany, Matthew Jagielski, Adam Dziedzic, Franziska Boenisch
In The 39th Annual AAAI Conference on Artificial Intelligence 2025
Paper

Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DP-SGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy (epsilon<1) and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of pure DP. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL’s high performance under strong privacy guarantees in challenging private learning setups.
```
@inproceedings{wahdany2024dppl,
  title = {Differentially Private Prototypes for Private Transfer Learning},
  author = {Wahdany, Dariush and Jagielski, Matthew and Dziedzic, Adam and Boenisch, Franziska},
  booktitle = {The 39th Annual AAAI Conference on Artificial Intelligence},
  year = {2025}
}
```

2024

Memorization in Self-Supervised Learning Improves Downstream Generalization
Wenhao Wang, Muhammad Ahmad Kaleem, Adam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch
In The Twelfth International Conference on Learning Representations (ICLR) 2024
Paper Code

Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data—often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations—both known in supervised learning as regularization techniques that reduce overfitting—still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.
```
@inproceedings{wang2024memorization,
  title = {Memorization in Self-Supervised Learning Improves Downstream Generalization},
  author = {Wang, Wenhao and Kaleem, Muhammad Ahmad and Dziedzic, Adam and Backes, Michael and Papernot, Nicolas and Boenisch, Franziska},
  booktitle = {The Twelfth International Conference on Learning Representations (ICLR)},
  year = {2024}
}
```
Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data
Congyu Fang, Adam Dziedzic, Lin Zhang, Laura Oliva, Amol Verma, Fahad Razak, Nicolas Papernot, Bo Wang
In eBioMedicine 2024
Paper

Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.
```
@inproceedings{fang2024collaborative,
  title = {Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data},
  author = {Fang, Congyu and Dziedzic, Adam and Zhang, Lin and Oliva, Laura and Verma, Amol and Razak, Fahad and Papernot, Nicolas and Wang, Bo},
  booktitle = {eBioMedicine},
  year = {2024}
}
```
Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives
Vincent Hanke, Tom Blanchard, Franziska Boenisch, Iyiola Emmanuel Olatunji, Michael Backes, Adam Dziedzic
In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024
Paper Poster Slides Video Code Blog Post

While open Large Language Models (LLMs) have made significant progress, they still fall short of matching the performance of their closed, proprietary counterparts, making the latter attractive even for the use on highly private data. Recently, various new methods have been proposed to adapt closed LLMs to private data without leaking private information to third parties and/or the LLM provider. In this work, we analyze the privacy protection and performance of the four most recent methods for private adaptation of closed LLMs. By examining their threat models and thoroughly comparing their performance under different privacy levels according to differential privacy (DP), various LLM architectures, and multiple datasets for classification and generation tasks, we find that: (1) all the methods leak query data, i.e., the (potentially sensitive) user data that is queried at inference time, to the LLM provider, (2) three out of four methods also leak large fractions of private training data to the LLM provider while the method that protects private data requires a local open LLM, (3) all the methods exhibit lower performance compared to three private gradient-based adaptation methods for local open LLMs, and (4) the private adaptation methods for closed LLMs incur higher monetary training and query costs than running the alternative methods on local open LLMs. This yields the conclusion that, to achieve truly privacy-preserving LLM adaptations that yield high performance and more privacy at lower costs, taking into account current methods and models, one should use open LLMs.
```
@inproceedings{hanke2024openLLMs,
  title = {Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives},
  author = {Hanke, Vincent and Blanchard, Tom and Boenisch, Franziska and Olatunji, Iyiola Emmanuel and Backes, Michael and Dziedzic, Adam},
  year = {2024},
  booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
}
```
LLM Dataset Inference: Did you train on my dataset?
Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic
In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024
Paper

The proliferation of large language models (LLMs) in the real world has come with a rise in copyright cases against companies for training their models on unlicensed data from the internet. Recent works have presented methods to identify if individual text sequences were members of the model’s training data, known as membership inference attacks (MIAs). We demonstrate that the apparent success of these MIAs is confounded by selecting non-members (text sequences not used for training) belonging to a different distribution from the members (e.g., temporally shifted recent Wikipedia articles compared with ones used to train the model). This distribution shift makes membership inference appear successful. However, most MIA methods perform no better than random guessing when discriminating between members and non-members from the same distribution (e.g., in this case, the same period of time). Even when MIAs work, we find that different MIAs succeed at inferring membership of samples from different distributions. Instead, we propose a new dataset inference method to accurately identify the datasets used to train large language models. This paradigm sits realistically in the modern-day copyright landscape, where authors claim that an LLM is trained over multiple documents (such as a book) written by them, rather than one particular paragraph. While dataset inference shares many of the challenges of membership inference, we solve it by selectively combining the MIAs that provide positive signal for a given distribution, and aggregating them to perform a statistical test on a given dataset. Our approach successfully distinguishes the train and test sets of different subsets of the Pile with statistically significant p-values < 0.1, without any false positives.
```
@inproceedings{maini2024LLMDatasetInference,
  title = {LLM Dataset Inference: Did you train on my dataset?},
  author = {Maini, Pratyush and Jia, Hengrui and Papernot, Nicolas and Dziedzic, Adam},
  year = {2024},
  booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
}
```
Localizing Memorization in SSL Vision Encoders
Wenhao Wang, Adam Dziedzic, Michael Backes, Franziska Boenisch
In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024
Paper

Recent work on studying memorization in self-supervised learning (SSL) suggests that even though SSL encoders are trained on millions of images, they still memorize individual data points. While effort has been put into characterizing the memorized data and linking encoder memorization to downstream utility, little is known about where the memorization happens inside SSL encoders. To close this gap, we propose two metrics for localizing memorization in SSL encoders on a per-layer (layermem) and per-unit basis (unitmem). Our localization methods are independent of the downstream task, do not require any label information, and can be performed in a forward pass. By localizing memorization in various encoder architectures (convolutional and transformer-based) trained on diverse datasets with contrastive and non-contrastive SSL frameworks, we find that (1) while SSL memorization increases with layer depth, highly memorizing units are distributed across the entire encoder, (2) a significant fraction of units in SSL encoders experiences surprisingly high memorization of individual data points, which is in contrast to models trained under supervision, (3) atypical (or outlier) data points cause much higher layer and unit memorization than standard data points, and (4) in vision transformers, most memorization happens in the fully-connected layers. Finally, we show that localizing memorization in SSL has the potential to improve fine-tuning and to inform pruning strategies.
```
@inproceedings{wang2024LocalizeMemorizationSSL,
  title = {Localizing Memorization in SSL Vision Encoders},
  author = {Wang, Wenhao and Dziedzic, Adam and Backes, Michael and Boenisch, Franziska},
  year = {2024},
  booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
}
```
Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models
Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch
In Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS) 2024
Paper

Diffusion models (DMs) produce very detailed and high-quality images. Their power results from extensive training on large amounts of data, usually scraped from the internet without proper attribution or consent from content creators. Unfortunately, this practice raises privacy and intellectual property concerns, as DMs can memorize and later reproduce their potentially sensitive or copyrighted training images at inference time. Prior efforts prevent this issue by either changing the input to the diffusion process, thereby preventing the DM from generating memorized samples during inference, or removing the memorized data from training altogether. While those are viable solutions when the DM is developed and deployed in a secure and constantly monitored environment, they hold the risk of adversaries circumventing the safeguards and are not effective when the DM itself is publicly released. To solve the problem, we introduce NeMo, the first method to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers. Through our experiments, we make the intriguing finding that in many cases, single neurons are responsible for memorizing particular training samples. By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data. In this way, our NeMo contributes to a more responsible deployment of DMs.
```
@inproceedings{hintersdorf2024MemorizationDiffusionModels,
  title = {Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models},
  author = {Hintersdorf, Dominik and Struppek, Lukas and Kersting, Kristian and Dziedzic, Adam and Boenisch, Franziska},
  year = {2024},
  booktitle = {Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS)}
}
```

2023

Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees
Franziska Boenisch, Christopher Mühl, Roy Rinberg, Jannis Ihrig, Adam Dziedzic
In Privacy Enhancing Technologies Symposium (PETS) 2023
Paper

Applying machine learning (ML) to sensitive domains requires privacy protection of the underlying training data through formal privacy frameworks, such as differential privacy (DP). Yet, usually, the privacy of the training data comes at the cost of the resulting ML models’ utility. One reason for this is that DP uses one uniform privacy budget epsilon for all training data points, which has to align with the strictest privacy requirement encountered among all data holders. In practice, different data holders have different privacy requirements and data points of data holders with lower requirements can contribute more information to the training process of the ML models. To account for this need, we propose two novel methods based on the Private Aggregation of Teacher Ensembles (PATE) framework to support the training of ML models with individualized privacy guarantees. We formally describe the methods, provide a theoretical analysis of their privacy bounds, and experimentally evaluate their effect on the final model’s utility using the MNIST, SVHN, and Adult income datasets. Our empirical results show that the individualized privacy methods yield ML models of higher accuracy than the non-individualized baseline. Thereby, we improve the privacy-utility trade-off in scenarios in which different data holders consent to contribute their sensitive data at different individual privacy levels.
```
@inproceedings{pate2023pets,
  author = {Boenisch, Franziska and Mühl, Christopher and Rinberg, Roy and Ihrig, Jannis and Dziedzic, Adam},
  title = {Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees},
  booktitle = {Privacy Enhancing Technologies Symposium (PETS)},
  year = {2023}
}
```
Private Multi-Winner Voting for Machine Learning
Adam Dziedzic, Christopher A Choquette-Choo, Natalie Dullerud, Vinith Menon Suriyakumar, Ali Shahin Shamsabadi, Muhammad Ahmad Kaleem, Somesh Jha, Nicolas Papernot, Xiao Wang
In Privacy Enhancing Technologies Symposium (PETS) 2023
Paper Slides Video

Private multi-winner voting is the task of revealing k-hot binary vectors satisfying a bounded differential privacy (DP) guarantee. This task has been understudied in machine learning literature despite its prevalence in many domains such as healthcare. We propose three new DP multi-winner mechanisms: Binary, τ, and Powerset voting. Binary voting operates independently per label through composition. τ voting bounds votes optimally in their ℓ2 norm for tight data-independent guarantees. Powerset voting operates over the entire binary vector by viewing the possible outcomes as a power set. Our theoretical and empirical analysis shows that Binary voting can be a competitive mechanism on many tasks unless there are strong correlations between labels, in which case Powerset voting outperforms it. We use our mechanisms to enable privacy-preserving multi-label learning in the central setting by extending the canonical single-label technique: PATE. We find that our techniques outperform current state-of-the-art approaches on large, real-world healthcare data and standard multi-label benchmarks. We further enable multi-label confidential and private collaborative (CaPC) learning and show that model performance can be significantly improved in the multi-site setting.
```
@inproceedings{multilabel2023pets,
  title = {Private Multi-Winner Voting for Machine Learning},
  author = {Dziedzic, Adam and Choquette-Choo, Christopher A and Dullerud, Natalie and Suriyakumar, Vinith Menon and Shamsabadi, Ali Shahin and Kaleem, Muhammad Ahmad and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
  booktitle = {Privacy Enhancing Technologies Symposium (PETS)},
  year = {2023}
}
```
Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders
Jan Dubiński, Stanisław Pawlak, Franziska Boenisch, Tomasz Trzcinski, Adam Dziedzic
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper Poster Slides Video Code Blog Post

Machine Learning as a Service (MLaaS) APIs provide ready-to-use and high-utility encoders that generate vector representations for given inputs. Since these encoders are very costly to train, they become lucrative targets for model stealing attacks during which an adversary leverages query access to the API to replicate the encoder locally at a fraction of the original training costs. We propose Bucks for Buckets (B4B), the first active defense that prevents stealing while the attack is happening without degrading representation quality for legitimate API users. Our defense relies on the observation that the representations returned to adversaries who try to steal the encoder’s functionality cover a significantly larger fraction of the embedding space than representations of legitimate users who utilize the encoder to solve a particular downstream task. B4B leverages this to adaptively adjust the utility of the returned representations according to a user’s coverage of the embedding space. To prevent adaptive adversaries from eluding our defense by simply creating multiple user accounts (sybils), B4B also individually transforms each user’s representations. This prevents the adversary from directly aggregating representations over multiple accounts to create their stolen encoder copy. Our active defense opens a new path towards securely sharing and democratizing encoders over public APIs.
```
@inproceedings{dubinski2023bucks,
  title = {Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders},
  author = {Dubiński, Jan and Pawlak, Stanisław and Boenisch, Franziska and Trzcinski, Tomasz and Dziedzic, Adam},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2023}
}
```
Robust and Actively Secure Serverless Collaborative Learning
Nicholas Franzese, Adam Dziedzic, Christopher A. Choquette-Choo, Mark R. Thomas, Muhammad Ahmad Kaleem, Stephan Rabanser, Congyu Fang, Somesh Jha, Nicolas Papernot, Xiao Wang
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper Poster

Collaborative machine learning (ML) is widely used to enable institutions to learn better models from distributed data. While collaborative approaches to learning intuitively protect user data, they remain vulnerable to either the server, the clients, or both, deviating from the protocol. Indeed, because the protocol is asymmetric, a malicious server can abuse its power to reconstruct client data points. Conversely, malicious clients can corrupt learning with malicious updates. Thus, both clients and servers require a guarantee when the other cannot be trusted to fully cooperate. In this work, we propose a peer-to-peer (P2P) learning scheme that is secure against malicious servers and robust to malicious clients. Our core contribution is a generic framework that transforms any (compatible) algorithm for robust aggregation of model updates to the setting where servers and clients can act maliciously. Finally, we demonstrate the computational efficiency of our approach even with 1-million parameter models trained by 100s of peers on standard datasets.
```
@inproceedings{franzeses2023p2pml,
  title = {Robust and Actively Secure Serverless Collaborative Learning},
  author = {Franzese, Nicholas and Dziedzic, Adam and Choquette-Choo, Christopher A. and Thomas, Mark R. and Kaleem, Muhammad Ahmad and Rabanser, Stephan and Fang, Congyu and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2023}
}
```
Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models
Haonan Duan, Adam Dziedzic, Nicolas Papernot, Franziska Boenisch
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper Slides Video Code

Large language models (LLMs) are excellent in-context learners. However, the sensitivity of data contained in prompts raises privacy concerns. Our work first shows that these concerns are valid: we instantiate a simple but highly effective membership inference attack against the data used to prompt LLMs. To address this vulnerability, one could forego prompting and resort to fine-tuning LLMs with known algorithms for private gradient descent. However, this comes at the expense of the practicality and efficiency offered by prompting. Therefore, we propose to privately learn to prompt. We first show that soft prompts can be obtained privately through gradient descent on downstream data. However, this is not the case for discrete prompts. Thus, we orchestrate a noisy vote among an ensemble of LLMs presented with different prompts, i.e., a flock of stochastic parrots. The vote privately transfers the flock’s knowledge into a single public prompt. We show that LLMs prompted with our private algorithms closely match the non-private baselines. For example, using GPT3 as the base model, we achieve a downstream accuracy of 92.7% on the sst2 dataset with strong differential privacy guarantees vs. 95.2% for the non-private baseline. Through our experiments, we also show that our prompt-based approach is easily deployed with existing commercial APIs.
```
@inproceedings{duan2023flocks,
  title = {Flocks of Stochastic Parrots: Differentially Private Prompt Learning for Large Language Models},
  author = {Duan, Haonan and Dziedzic, Adam and Papernot, Nicolas and Boenisch, Franziska},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2023}
}
```
Have it your way: Individualized Privacy Assignment for DP-SGD
Franziska Boenisch, Christopher Mühl, Adam Dziedzic, Roy Rinberg, Nicolas Papernot
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper

When training a machine learning model with differential privacy, one sets a privacy budget. This budget represents a maximal privacy violation that any user is willing to face by contributing their data to the training set. We argue that this approach is limited because different users may have different privacy expectations. Thus, setting a uniform privacy budget across all points may be overly conservative for some users or, conversely, not sufficiently protective for others. In this paper, we capture these preferences through individualized privacy budgets. To demonstrate their practicality, we introduce a variant of Differentially Private Stochastic Gradient Descent (DP-SGD) which supports such individualized budgets. DP-SGD is the canonical approach to training models with differential privacy. We modify its data sampling and gradient noising mechanisms to arrive at our approach, which we call Individualized DP-SGD (IDP-SGD). Because IDP-SGD provides privacy guarantees tailored to the preferences of individual users and their data points, we find it empirically improves privacy-utility trade-offs.
```
@inproceedings{boenisch2023idpsgd,
  title = {Have it your way: Individualized Privacy Assignment for DP-SGD},
  author = {Boenisch, Franziska and Mühl, Christopher and Dziedzic, Adam and Rinberg, Roy and Papernot, Nicolas},
  year = {2023},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
  eprint = {2303.17046},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG}
}
```
On the privacy risk of in-context learning
Haonan Duan, Adam Dziedzic, Mohammad Yaghini, Nicolas Papernot, Franziska Boenisch
In The 61st Annual Meeting Of The Association For Computational Linguistics 2023
Paper

Large language models (LLMs) are excellent few-shot learners. They can perform a wide variety of tasks purely based on natural language prompts provided to them. These prompts contain data of a specific downstream task—often the private dataset of a party, e.g., a company that wants to leverage the LLM on their purposes. We show that deploying prompted models presents a significant privacy risk for the data used within the prompt by instantiating a highly effective membership inference attack. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels. After identifying the model’s sensitivity to their prompts—in form of a significantly higher prediction confidence on the prompted data—as a cause for the increased risk, we propose ensembling as a mitigation strategy. By aggregating over multiple different versions of a prompted model, membership inference risk can be decreased.
```
@inproceedings{duan2023privacyICL,
  title = {On the privacy risk of in-context learning},
  author = {Duan, Haonan and Dziedzic, Adam and Yaghini, Mohammad and Papernot, Nicolas and Boenisch, Franziska},
  booktitle = {The 61st Annual Meeting Of The Association For Computational Linguistics},
  year = {2023}
}
```

2022

Increasing the Cost of Model Extraction with Calibrated Proof of Work
Adam Dziedzic, Muhammad Ahmad Kaleem, Yu Shen Lu, Nicolas Papernot
In ICLR (International Conference on Learning Representations) [SPOTLIGTH] 2022
Paper Slides Video Code Blog Post

In model extraction attacks, adversaries can steal a machine learning model exposed via a public API by repeatedly querying it and adjusting their own model based on obtained predictions. To prevent model stealing, existing defenses focus on detecting malicious queries, truncating, or distorting outputs, thus necessarily introducing a tradeoff between robustness and model utility for legitimate users. Instead, we propose to impede model extraction by requiring users to complete a proof-of-work before they can read the model’s predictions. This deters attackers by greatly increasing (even up to 100x) the computational effort needed to leverage query access for model extraction. Since we calibrate the effort required to complete the proof-of-work to each query, this only introduces a slight overhead for regular users (up to 2x). To achieve this, our calibration applies tools from differential privacy to measure the information revealed by a query. Our method requires no modification of the victim model and can be applied by machine learning practitioners to guard their publicly exposed models against being easily stolen.
```
@inproceedings{pow2022iclr,
  title = {Increasing the Cost of Model Extraction with Calibrated Proof of Work},
  author = {Dziedzic, Adam and Kaleem, Muhammad Ahmad and Lu, Yu Shen and Papernot, Nicolas},
  booktitle = {ICLR (International Conference on Learning Representations) [SPOTLIGTH]},
  year = {2022}
}
```
On the Difficulty of Defending Self-Supervised Learning against Model Extraction
Adam Dziedzic, Nikita Dhawan, Muhammad Ahmad Kaleem, Jonas Guan, Nicolas Papernot
In ICML (International Conference on Machine Learning) 2022
Paper

Self-Supervised Learning (SSL) is an increasingly popular ML paradigm that trains models to transform complex inputs into representations without relying on explicit labels. These representations encode similarity structures that enable efficient learning of multiple downstream tasks. Recently, ML-as-a-Service providers have commenced offering trained SSL models over inference APIs, which transform user inputs into useful representations for a fee. However, the high cost involved to train these models and their exposure over APIs both make black-box extraction a realistic security threat. We thus explore model stealing attacks against SSL. Unlike traditional model extraction on classifiers that output labels, the victim models here output representations; these representations are of significantly higher dimensionality compared to the low-dimensional prediction scores output by classifiers. We construct several novel attacks and find that approaches that train directly on a victim’s stolen representations are query efficient and enable high accuracy for downstream models. We then show that existing defenses against model extraction are inadequate and not easily retrofitted to the specificities of SSL.
```
@inproceedings{sslextractions2022icml,
  title = {On the Difficulty of Defending Self-Supervised Learning against Model Extraction},
  author = {Dziedzic, Adam and Dhawan, Nikita and Kaleem, Muhammad Ahmad and Guan, Jonas and Papernot, Nicolas},
  booktitle = {ICML (International Conference on Machine Learning)},
  year = {2022}
}
```
Dataset Inference for Self-Supervised Models
Adam Dziedzic, Haonan Duan, Muhammad Ahmad Kaleem, Nikita Dhawan, Jonas Guan, Yannis Cattan, Franziska Boenisch, Nicolas Papernot
In NeurIPS (Neural Information Processing Systems) 2022
Paper Slides Video

Self-supervised models are increasingly prevalent in machine learning (ML) since they reduce the need for expensively labeled data. Because of their versatility in downstream applications, they are increasingly used as a service exposed via public APIs. At the same time, these encoder models are particularly vulnerable to model stealing attacks due to the high dimensionality of vector representations they output. Yet, encoders remain undefended: existing mitigation strategies for stealing attacks focus on supervised learning. We introduce a new dataset inference defense, which uses the private training set of the victim encoder model to attribute its ownership in the event of stealing. The intuition is that the log-likelihood of an encoder’s output representations is higher on the victim’s training data than on test data if it is stolen from the victim, but not if it is independently trained. We compute this log-likelihood using density estimation models. As part of our evaluation, we also propose measuring the fidelity of stolen encoders and quantifying the effectiveness of the theft detection without involving downstream tasks; instead, we leverage mutual information and distance measurements. Our extensive empirical results in the vision domain demonstrate that dataset inference is a promising direction for defending self-supervised models against model stealing.
```
@inproceedings{datasetinference2022neurips,
  title = {Dataset Inference for Self-Supervised Models},
  author = {Dziedzic, Adam and Duan, Haonan and Kaleem, Muhammad Ahmad and Dhawan, Nikita and Guan, Jonas and Cattan, Yannis and Boenisch, Franziska and Papernot, Nicolas},
  booktitle = {NeurIPS (Neural Information Processing Systems)},
  year = {2022}
}
```

2021

CaPC Learning: Confidential and Private Collaborative Learning
Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, Xiao Wang
In ICLR (International Conference on Learning Representations) 2021
Paper Slides Video Code Blog Post

Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other’s data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
```
@inproceedings{capc2021iclr,
  title = {CaPC Learning: Confidential and Private Collaborative Learning},
  author = {Choquette-Choo, Christopher A. and Dullerud, Natalie and Dziedzic, Adam and Zhang, Yunxiang and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
  booktitle = {ICLR (International Conference on Learning Representations)},
  year = {2021}
}
```
Preoperative paraspinal neck muscle characteristics predict early onset adjacent segment degeneration in anterior cervical fusion patients: A machine-learning modeling analysis
Arnold Y. L. Wong, Garrett Harada, Remy Lee, Sapan D. Gandhi, Adam Dziedzic, Alejandro Espinoza-Orias, Mohamad Parnianpour, Philip K. Louie, Bryce Basques, Howard S. An, Dino Samartzis
Journal of Orthopaedic Research 2021
Paper

Abstract Early onset adjacent segment degeneration (ASD) can be found within six months after anterior cervical discectomy and fusion (ACDF). Deficits in deep paraspinal neck muscles may be related to early onset ASD. This study aimed to determine whether the morphometry of preoperative deep neck muscles (multifidus and semispinalis cervicis) predicted early onset ASD in patients with ACDF. Thirty-two cases of early onset ASD after a two-level ACDF and 30 matched non-ASD cases were identified from a large-scale cohort. The preoperative total cross-sectional area (CSA) of bilateral deep neck muscles and the lean muscle CSAs from C3 to C7 levels were measured manually on T2-weighted magnetic resonance imaging. Paraspinal muscle CSA asymmetry at each level was calculated. A support vector machine (SVM) algorithm was used to identify demographic, radiographic, and/or muscle parameters that predicted proximal/distal ASD development. No significant between-group differences in demographic or preoperative radiographic data were noted (mean age: 52.4 ± 10.9 years). ACDFs comprised C3 to C5 (n = 9), C4 to C6 (n = 20), and C5 to C7 (n = 32) cases. Eighteen, eight, and six patients had proximal, distal, or both ASD, respectively. The SVM model achieved high accuracy (96.7%) and an area under the curve (AUC = 0.97) for predicting early onset ASD. Asymmetry of fat at C5 (coefficient: 0.06), and standardized measures of C7 lean (coefficient: 0.05) and total CSA measures (coefficient: 0.05) were the strongest predictors of early onset ASD. This is the first study to show that preoperative deep neck muscle CSA, composition, and asymmetry at C5 to C7 independently predicted postoperative early onset ASD in patients with ACDF. Paraspinal muscle assessments are recommended to identify high-risk patients for personalized intervention.
```
@article{wong2021ML,
  author = {Wong, Arnold Y. L. and Harada, Garrett and Lee, Remy and Gandhi, Sapan D. and Dziedzic, Adam and Espinoza-Orias, Alejandro and Parnianpour, Mohamad and Louie, Philip K. and Basques, Bryce and An, Howard S. and Samartzis, Dino},
  title = {Preoperative paraspinal neck muscle characteristics predict early onset adjacent segment degeneration in anterior cervical fusion patients: A machine-learning modeling analysis},
  journal = {Journal of Orthopaedic Research},
  volume = {39},
  number = {8},
  pages = {1732-1744},
  keywords = {adjacent segment, cervical, degeneration, disc, disease, muscles, paraspinal, spine},
  doi = {https://doi.org/10.1002/jor.24829},
  eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/jor.24829},
  year = {2021}
}
```
On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples
Adelin Travers, Lorna Licollari, Guanghan Wang, Varun Chandrasekaran, Adam Dziedzic, David Lie, Nicolas Papernot
2021 Preprint
Paper

Machine learning (ML) models are known to be vulnerable to adversarial examples. Applications of ML to voice biometrics authentication are no exception. Yet, the implications of audio adversarial examples on these real-world systems remain poorly understood given that most research targets limited defenders who can only listen to the audio samples. Conflating detectability of an attack with human perceptibility, research has focused on methods that aim to produce imperceptible adversarial examples which humans cannot distinguish from the corresponding benign samples. We argue that this perspective is coarse for two reasons: 1. Imperceptibility is impossible to verify; it would require an experimental process that encompasses variations in listener training, equipment, volume, ear sensitivity, types of background noise etc, and 2. It disregards pipeline-based detection clues that realistic defenders leverage. This results in adversarial examples that are ineffective in the presence of knowledgeable defenders. Thus, an adversary only needs an audio sample to be plausible to a human. We thus introduce surreptitious adversarial examples, a new class of attacks that evades both human and pipeline controls. In the white-box setting, we instantiate this class with a joint, multi-stage optimization attack. Using an Amazon Mechanical Turk user study, we show that this attack produces audio samples that are more surreptitious than previous attacks that aim solely for imperceptibility. Lastly we show that surreptitious adversarial examples are challenging to develop in the black-box setting.
```
@misc{travers2021exploitability,
  title = {On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples},
  author = {Travers, Adelin and Licollari, Lorna and Wang, Guanghan and Chandrasekaran, Varun and Dziedzic, Adam and Lie, David and Papernot, Nicolas},
  year = {2021},
  eprint = {2108.02010},
  archiveprefix = {arXiv},
  primaryclass = {cs.SD},
  journal = {preprint arXiv:2108.02010}
}
```
When the Curious Abandon Honesty: Federated Learning Is Not Private
Franziska Boenisch, Adam Dziedzic, Roei Schuster, Ali Shahin Shamsabadi, Ilia Shumailov, Nicolas Papernot
2021 Preprint
Paper Paper

In federated learning (FL), data does not leave personal devices when they are jointly training a machine learning model. Instead, these devices share gradients with a central party (e.g., a company). Because data never "leaves" personal devices, FL is presented as privacy-preserving. Yet, recently it was shown that this protection is but a thin facade, as even a passive attacker observing gradients can reconstruct data of individual users. In this paper, we argue that prior work still largely underestimates the vulnerability of FL. This is because prior efforts exclusively consider passive attackers that are honest-but-curious. Instead, we introduce an active and dishonest attacker acting as the central party, who is able to modify the shared model’s weights before users compute model gradients. We call the modified weights "trap weights". Our active attacker is able to recover user data perfectly and at near zero costs: the attack requires no complex optimization objectives. Instead, it exploits inherent data leakage from model gradients and amplifies this effect by maliciously altering the weights of the shared model. These specificities enable our attack to scale to models trained with large mini-batches of data. Where attackers from prior work require hours to recover a single data point, our method needs milliseconds to capture the full mini-batch of data from both fully-connected and convolutional deep neural networks. Finally, we consider mitigations. We observe that current implementations of differential privacy (DP) in FL are flawed, as they explicitly trust the central party with the crucial task of adding DP noise, and thus provide no protection against a malicious central party. We also consider other defenses and explain why they are similarly inadequate. A significant redesign of FL is required for it to provide any meaningful form of data privacy to users.
```
@misc{boenisch2021curious,
  title = {When the Curious Abandon Honesty: Federated Learning Is Not Private},
  author = {Boenisch, Franziska and Dziedzic, Adam and Schuster, Roei and Shamsabadi, Ali Shahin and Shumailov, Ilia and Papernot, Nicolas},
  year = {2021},
  eprint = {2112.02918},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG},
  journal = {preprint arXiv:2112.02918}
}
```
Private AI Collaborative Research Institute: Vision, Challenges, and Opportunities
Ahmad-Reza Sadeghi, Ferdinand Brasser, Markus Miettinen, Thien Duc Nguyen, Thomas Given-Wilson, Axel Legay, Murali Annaaram, Salman Avestimeh, Alexandra Dmitrienko, Farinaz Koushanfar, Buse Gul Atli, Florian Kerschbaum, Lachlan J. Gunn, N. Asokan, Matthias Schunter, Rosario Cammarota, Adam Dziedzic, Nicolas Papernot, Virginia Smith, Reza Shokri
2021
Paper

This document outlines the research vision of the collaborative research center for Privacy-preserving Machine Learning. While federated machine learning starts to be deployed, its security and privacy implications are not well understood today. Our goal is to conduct research enabling the future of decentralized machine learning: Underpinning Federated ML with robust privacy guarantees and efficient algoritms to achieve those guarantees. Exploring knowledge transfer and collaborative ML beyond Federated machine learning that suffers from a central controller as its root of trust. Exploring Graph Neural Networks and their privacy implications. Ensuring robustness against malicious participants that may steal models or may try to poison maching learning models during training. This research will then be validated in case studies and deployed in open source frameworks to allow further experimentation and deployment on a wider scale.
```
@misc{IntelPrivateAIVision2021,
  title = {Private AI Collaborative Research Institute: Vision, Challenges, and Opportunities},
  author = {Sadeghi, Ahmad-Reza and Brasser, Ferdinand and Miettinen, Markus and Nguyen, Thien Duc and Given-Wilson, Thomas and Legay, Axel and Annaaram, Murali and Avestimeh, Salman and Dmitrienko, Alexandra and Koushanfar, Farinaz and Atli, Buse Gul and Kerschbaum, Florian and Gunn, Lachlan J. and Asokan, N. and Schunter, Matthias and Cammarota, Rosario and Dziedzic, Adam and Papernot, Nicolas and Smith, Virginia and Shokri, Reza},
  year = {2021}
}
```
Private Multi-Winner Voting For Machine Learning
Adam Dziedzic, Christopher A Choquette-Choo, Natalie Dullerud, Vinith Menon Suriyakumar, Ali Shahin Shamsabadi, Muhammad Ahmad Kaleem, Somesh Jha, Nicolas Papernot, Xiao Wang
2021
Paper

Private multi-winner voting is the task of revealing k-hot binary vectors that satisfy a bounded differential privacy guarantee. This task has been understudied in the machine learning literature despite its prevalence in many domains such as healthcare. We propose three new privacy-preserving multi-label mechanisms: Binary, and Powerset voting. Binary voting operates independently per label through composition. voting bounds votes optimally in their norm. Powerset voting operates over the entire binary vector by viewing the possible outcomes as a power set. We theoretically analyze tradeoffs showing that Powerset voting requires strong correlations between labels to outperform Binary voting. We use these mechanisms to enable privacy-preserving multi-label learning by extending the canonical single-label technique: PATE. We empirically compare our techniques with DPSGD on large real-world healthcare data and standard multi-label benchmarks. We find that our techniques outperform all others in the centralized setting. We enable multi-label CaPC and show that our mechanisms can be used to collaboratively improve models in a multi-site (distributed) setting.
```
@misc{PrivateMultiWinnerVoting2021,
  title = {Private Multi-Winner Voting For Machine Learning},
  author = {Dziedzic, Adam and Choquette-Choo, Christopher A and Dullerud, Natalie and Suriyakumar, Vinith Menon and Shamsabadi, Ali Shahin and Kaleem, Muhammad Ahmad and Jha, Somesh and Papernot, Nicolas and Wang, Xiao},
  year = {2021}
}
```

2020

Pretrained Transformers Improve Out-of-Distribution Robustness
Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song
In ACL (Association for Computational Linguistics) 2020
Paper Slides Video Code

Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers’ performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
```
@inproceedings{hendrycks-etal-2020-pretrained,
  title = {Pretrained Transformers Improve Out-of-Distribution Robustness},
  author = {Hendrycks, Dan and Liu, Xiaoyuan and Wallace, Eric and Dziedzic, Adam and Krishnan, Rishabh and Song, Dawn},
  booktitle = { ACL (Association for Computational Linguistics)},
  year = {2020},
  address = {Online},
  publisher = {ACL (Association for Computational Linguistics)},
  doi = {10.18653/v1/2020.acl-main.244},
  pages = {2744--2751}
}
```
Machine Learning based detection of multiple Wi-Fi BSSs for LTE-U CSAT
Vanlin Sathya, Adam Dziedzic, Monisha Ghosh, Sanjay Krishnan
In ICNC (International Conference on Computing, Networking and Communications) 2020
Paper

According to the LTE-U Forum specification, a LTE-U base-station (BS) reduces its duty cycle from 50% to 33% when it senses an increase in the number of co-channel Wi-Fi basic service sets (BSSs) from one to two. The detection of the number of Wi-Fi BSSs that are operating on the channel in real-time, without decoding the Wi-Fi packets, still remains a challenge. In this paper, we present a novel machine learning (ML) approach that solves the problem by using energy values observed during LTE-U OFF duration. Observing the energy values (at LTE-U BS OFF time) is a much simpler operation than decoding the entire Wi-Fi packets. In this work, we implement and validate the proposed ML based approach in real-time experiments, and demonstrate that there are two distinct patterns between one and two Wi-Fi APs. This approach delivers an accuracy close to 100% compared to auto-correlation (AC) and energy detection (ED) approaches.
```
@inproceedings{sathya2020machine,
  title = {Machine Learning based detection of multiple Wi-Fi BSSs for LTE-U CSAT},
  author = {Sathya, Vanlin and Dziedzic, Adam and Ghosh, Monisha and Krishnan, Sanjay},
  booktitle = {ICNC (International Conference on Computing, Networking and Communications)},
  year = {2020},
  organization = {IEEE}
}
```
An Empirical Evaluation of Perturbation-based Defenses
Adam Dziedzic, Sanjay Krishnan
preprint arXiv:2002.03080 2020 Preprint
Paper

Recent work has extensively shown that randomized perturbations of a neural network can improve its robustness to adversarial attacks. The literature is, however, lacking a detailed compare-and-contrast of the latest proposals to understand what classes of perturbations work, when they work, and why they work. We contribute a detailed experimental evaluation that elucidates these questions and benchmarks perturbation defenses in a consistent way. In particular, we show five main results: (1) all input perturbation defenses, whether random or deterministic, are essentially equivalent in their efficacy, (2) such defenses offer almost no robustness to adaptive attacks unless these perturbations are observed during training, (3) a tuned sequence of noise layers across a network provides the best empirical robustness, (4) attacks transfer between perturbation defenses so the attackers need not know the specific type of defense only that it involves perturbations, and (5) adversarial examples very close to original images show an elevated sensitivity to perturbation in a first-order analysis. Based on these insights, we demonstrate a new robust model built on noise injection and adversarial training that achieves state-of-the-art robustness.
```
@article{dziedzic2020empirical,
  title = {An Empirical Evaluation of Perturbation-based Defenses},
  author = {Dziedzic, Adam and Krishnan, Sanjay},
  journal = {preprint arXiv:2002.03080},
  year = {2020}
}
```
Machine Learning enabled Spectrum Sharing in Dense LTE-U/Wi-Fi Coexistence Scenarios
Adam Dziedzic, Vanlin Sathya, Muhammad Rochman, Monisha Ghosh, Sanjay Krishnan
OJVT (IEEE Open Journal of Vehicular Technology) 2020

The application of Machine Learning (ML) techniques to complex engineering problems has proved to be an attractive and efficient solution. ML has been successfully applied to several practical tasks like image recognition, automating industrial operations, etc. The promise of ML techniques in solving non-linear problems influenced this work which aims to apply known ML techniques and develop new ones for wireless spectrum sharing between Wi-Fi and LTE in the unlicensed spectrum. In this work, we focus on the LTE-Unlicensed (LTE-U) specification developed by the LTE-U Forum, which uses the duty-cycle approach for fair coexistence. The specification suggests reducing the duty cycle at the LTE-U base-station (BS) when the number of co-channel Wi-Fi basic service sets (BSSs) increases from one to two or more. However, without decoding the Wi-Fi packets, detecting the number of Wi-Fi BSSs operating on the channel in real-time is a challenging problem. In this work, we demonstrate a novel ML-based approach which solves this problem by using energy values observed during the LTE-U OFF duration. It is relatively straightforward to observe only the energy values during the LTE-U BS OFF time compared to decoding the entire Wi-Fi packet, which would require a full Wi-Fi receiver at the LTE-U base-station. We implement and validate the proposed ML-based approach by real-time experiments and demonstrate that there exist distinct patterns between the energy distributions between one and many Wi-Fi AP transmissions. The proposed ML-based approach results in a higher accuracy (close to 99% in all cases) as compared to the existing auto-correlation (AC) and energy detection (ED) approaches.
```
@article{dziedzic2020machine,
  title = {Machine Learning enabled Spectrum Sharing in Dense LTE-U/Wi-Fi Coexistence Scenarios},
  author = {Dziedzic, Adam and Sathya, Vanlin and Rochman, Muhammad and Ghosh, Monisha and Krishnan, Sanjay},
  journal = {OJVT (IEEE Open Journal of Vehicular Technology)},
  year = {2020},
  publisher = {IEEE}
}
```

Input and Model Compression for Adaptive and Robust Neural Networks
Adam Dziedzic
2020 Thesis

Paper

@article{dziedzic2020input,
  title = {Input and Model Compression for Adaptive and Robust Neural Networks},
  author = {Dziedzic, Adam},
  year = {2020},
  publisher = {The University of Chicago}
}

2019

Band-limited Training and Inference for Convolutional Neural Networks
Adam Dziedzic, Ioannis Paparizzos, Sanjay Krishnan, Aaron Elmore, Michael Franklin
In ICML (International Conference on Machine Learning) 2019
Paper Slides Video

The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.
```
@inproceedings{dziedzic2019band,
  title = {Band-limited Training and Inference for Convolutional Neural Networks},
  author = {Dziedzic, Adam and Paparizzos, Ioannis and Krishnan, Sanjay and Elmore, Aaron and Franklin, Michael},
  booktitle = {ICML (International Conference on Machine Learning)},
  year = {2019}
}
```
Artificial intelligence in resource-constrained and shared environments
Sanjay Krishnan, Aaron J Elmore, Michael Franklin, John Paparrizos, Zechao Shang, Adam Dziedzic, Rui Liu
ACM SIGOPS Operating Systems Review 2019
Paper

The computational demands of modern AI techniques are immense, and as the number of practical applications grows, there will be an increasing burden on shared computing infrastructure. We envision a forthcoming era of "AI Systems" research where reducing resource consumption, reasoning about transient resource availability, trading off resource consumption for accuracy, and managing contention on specialized hardware will become the community’s main research focus. This paper overviews the history of AI systems research, a vision for the future, and the open challenges ahead.
```
@article{krishnan2019artificial,
  title = {Artificial intelligence in resource-constrained and shared environments},
  author = {Krishnan, Sanjay and Elmore, Aaron J and Franklin, Michael and Paparrizos, John and Shang, Zechao and Dziedzic, Adam and Liu, Rui},
  journal = {ACM SIGOPS Operating Systems Review},
  volume = {53},
  number = {1},
  pages = {1--6},
  year = {2019},
  publisher = {ACM New York, NY, USA}
}
```

2018

Columnstore and B+ Tree - Are Hybrid Physical Designs Important?
Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, Manoj Syamala
In SIGMOD (ACM Special Interest Group on Management of Data) 2018
Paper Slides

Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied — a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.
```
@inproceedings{dziedzic2018index,
  title = {Columnstore and B+ Tree - Are Hybrid Physical Designs Important?},
  author = {Dziedzic, Adam and Wang, Jingjing and Das, Sudipto and Ding, Bolin and Narasayya, Vivek R. and Syamala, Manoj},
  booktitle = {SIGMOD (ACM Special Interest Group on Management of Data)},
  year = {2018}
}
```
Deeplens: Towards a visual data management system
Sanjay Krishnan, Adam Dziedzic, Aaron J Elmore
CIDR (Conference on Innovative Data Systems Research) 2018
Paper

Advances in deep learning have greatly widened the scope of automatic computer vision algorithms and enable users to ask questions directly about the content in images and video. This paper explores the necessary steps towards a future Visual Data Management System (VDMS), where the predictions of such deep learning models are stored, managed, queried, and indexed. We propose a query and data model that disentangles the neural network models used, the query workload, and the data source semantics from the query processing layer. Our system, DeepLens, is based on dataflow query processing systems and this research prototype presents initial experiments to elicit important open research questions in visual analytics systems. One of our main conclusions is that any future "declarative" VDMS will have to revisit query optimization and automated physical design from a unified perspective of performance and accuracy tradeoffs. Physical design and query optimization choices can not only change performance by orders of magnitude, they can potentially affect the accuracy of results.
```
@article{krishnan2018deeplens,
  title = {Deeplens: Towards a visual data management system},
  author = {Krishnan, Sanjay and Dziedzic, Adam and Elmore, Aaron J},
  journal = {CIDR (Conference on Innovative Data Systems Research)},
  year = {2018}
}
```

2017

Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.
Tim Mattson, Vijay Gadepally, Zuohao She, Adam Dziedzic, Jeff Parkhurst
In CIDR (Conference on Innovative Data Systems Research) 2017
Paper

In most Big Data applications, the data is heterogeneous. As we have been arguing in a series of papers, storage engines should be well suited to the data they hold. Therefore, a system supporting Big Data applications should be able to expose multiple storage engines through a single interface. We call such systems, polystore systems. Our reference implementation of the polystore concept is called BigDAWG (short for the Big Data Analytics Working Group). In this demonstration, we will show the BigDAWG system and a number of polystore applications built to help ocean metage-nomics researchers handle their heterogenous Big Data.
```
@inproceedings{mattson2017demonstrating,
  title = {Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis.},
  author = {Mattson, Tim and Gadepally, Vijay and She, Zuohao and Dziedzic, Adam and Parkhurst, Jeff},
  booktitle = {CIDR (Conference on Innovative Data Systems Research)},
  year = {2017}
}
```

Bigdawg polystore release and demonstration
Kyle OBrien, Vijay Gadepally, Jennie Duggan, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Zuohao She, Michael Stonebraker
preprint arXiv:1701.05799 2017 Preprint

Paper

@article{obrien2017bigdawg,
  title = {Bigdawg polystore release and demonstration},
  author = {OBrien, Kyle and Gadepally, Vijay and Duggan, Jennie and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and She, Zuohao and Stonebraker, Michael},
  journal = {preprint arXiv:1701.05799},
  year = {2017}
}

Version 0.1 of the bigdawg polystore system
Vijay Gadepally, Kyle OBrien, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Jennie Rogers, Zuohao She, Michael Stonebraker
preprint arXiv:1707.00721 2017 Preprint

Paper

@article{gadepally2017version,
  title = {Version 0.1 of the bigdawg polystore system},
  author = {Gadepally, Vijay and OBrien, Kyle and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and Rogers, Jennie and She, Zuohao and Stonebraker, Michael},
  journal = {preprint arXiv:1707.00721},
  year = {2017}
}

BigDAWG version 0.1
Vijay Gadepally, Kyle O’Brien, Adam Dziedzic, Aaron Elmore, Jeremy Kepner, Samuel Madden, Tim Mattson, Jennie Rogers, Zuohao She, Michael Stonebraker
In HPEC (IEEE High Performance Extreme Computing) 2017

@inproceedings{gadepally2017bigdawg,
  title = {BigDAWG version 0.1},
  author = {Gadepally, Vijay and O'Brien, Kyle and Dziedzic, Adam and Elmore, Aaron and Kepner, Jeremy and Madden, Samuel and Mattson, Tim and Rogers, Jennie and She, Zuohao and Stonebraker, Michael},
  booktitle = {HPEC (IEEE High Performance Extreme Computing)},
  pages = {1--7},
  year = {2017},
  organization = {IEEE}
}

Data Loading, Transformation and Migration for Database Management Systems
Adam Dziedzic
2017 Thesis

Paper

@article{dziedzic2017data,
  title = {Data Loading, Transformation and Migration for Database Management Systems},
  author = {Dziedzic, Adam},
  year = {2017},
  publisher = {The University of Chicago}
}

September 2017. BigDAWG Version 0.1
V Gadepally, K O’Brien, A Dziedzic, A Elmore, J Kepner, S Madden, T Mattson, J Rogers, Z She, M Stonebraker
HPEC (IEEE High Performance Extreme Computing) 2017

@article{gadepallyseptember,
  title = {September 2017. BigDAWG Version 0.1},
  author = {Gadepally, V and O'Brien, K and Dziedzic, A and Elmore, A and Kepner, J and Madden, S and Mattson, T and Rogers, J and She, Z and Stonebraker, M},
  journal = {HPEC (IEEE High Performance Extreme Computing)},
  year = {2017}
}

2016

DBMS Data Loading: An Analysis on Modern Hardware
Adam Dziedzic, Manos Karpathiotakis, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki
In ADMS (Accelerating analytics and Data Management Systems) 2016
Paper Slides

Data loading has traditionally been considered a one-time deal - an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the one-off assumption, and has turned data loading into a major bottleneck of the data analysis pipeline. This paper analyzes the behavior of modern DBMSs in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMSs, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMSs are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.
```
@inproceedings{dziedzic2016dbms,
  title = {DBMS Data Loading: An Analysis on Modern Hardware},
  author = {Dziedzic, Adam and Karpathiotakis, Manos and Alagiannis, Ioannis and Appuswamy, Raja and Ailamaki, Anastasia},
  booktitle = {ADMS (Accelerating analytics and Data Management Systems)},
  year = {2016}
}
```
Data Transformation and Migration in Polystores
Adam Dziedzic, Aaron Elmore, Michael Stonebraker
In HPEC (IEEE High Performance Extreme Computing) 2016
Paper Poster Slides

Ever increasing data size and new requirements in data processing has fostered the development of many new database systems. The result is that many data-intensive applications are underpinned by different engines. To enable data mobility there is a need to transfer data between systems easily and efficiently. We analyze the state-of-the-art of data migration and outline research opportunities for a rapid data transfer. Our experiments explore data migration between a diverse set of databases, including PostgreSQL, SciDB, S-Store and Accumulo. Each of the systems excels at specific application requirements, such as transactional processing, numerical computation, streaming data, and large scale text processing. Providing an efficient data migration tool is essential to take advantage of superior processing from that specialized databases. Our goal is to build such a data migration framework that will take advantage of recent advancement in hardware and software.
```
@inproceedings{dziedzic2016transformation,
  title = {Data Transformation and Migration in Polystores},
  author = {Dziedzic, Adam and Elmore, Aaron and Stonebraker, Michael},
  booktitle = {HPEC (IEEE High Performance Extreme Computing)},
  year = {2016},
  organization = {IEEE}
}
```
Integrating Real-Time and Batch Processing in a Polystore
John Meehan, Stan Zdonik, Shaobo Tian, Yulong Tian, Nesime Tatbul, Adam Dziedzic, Aaron Elmore
In HPEC (IEEE High Performance Extreme Computing) 2016
Paper

This paper describes a stream processing engine called S-Store and its role in the BigDAWG polystore. Fundamentally, S-Store acts as a frontend processor that accepts input from multiple sources, and massages it into a form that has eliminated errors (data cleaning) and translates that input into a form that can be efficiently ingested into BigDAWG. S-Store also acts as an intelligent router that sends input tuples to the appropriate components of BigDAWG. All updates to S-Store’s shared memory are done in a transactionally consistent (ACID) way, thereby eliminating new errors caused by non-synchronized reads and writes. The ability to migrate data from component to component of BigDAWG is crucial. We have described a migrator from S-Store to Postgres that we have implemented as a first proof of concept. We report some interesting results using this migrator that impact the evaluation of query plans.
```
@inproceedings{meehan2016integrating,
  title = {Integrating Real-Time and Batch Processing in a Polystore},
  author = {Meehan, John and Zdonik, Stan and Tian, Shaobo and Tian, Yulong and Tatbul, Nesime and Dziedzic, Adam and Elmore, Aaron},
  booktitle = {HPEC (IEEE High Performance Extreme Computing)},
  year = {2016}
}
```

2015

BigDAWG: a Polystore for Diverse Interactive Applications
Adam Dziedzic, Jennie Duggan, Aaron J. Elmore, Vijay Gadepally, Michael Stonebraker
In DSIA (IEEE Viz Data Systems for Interactive Analysis) 2015
Paper

Interactive analytics requires low latency queries in the presence of diverse, complex, and constantly evolving workloads. To address these challenges, we introduce a polystore, BigDAWG, that tightly couples diverse database systems, data models, and query languages through use of semantically grouped Islands of Information. BigDAWG, which stands for the Big Data Working Group, seeks to provide location transparency by matching the right system for each workload using black-box model of query and system performance. In this paper we introduce BigDAWG as a solution to diverse web-based interactive applications and motivate our key challenges in building BigDAWG. BigDAWG continues to evolve and, where applicable, we have noted the current status of its implementation.
```
@inproceedings{dziedzic2015bigdawg,
  title = {BigDAWG: a Polystore for Diverse Interactive Applications},
  author = {Dziedzic, Adam and Duggan, Jennie and Elmore, Aaron J. and Gadepally, Vijay and Stonebraker, Michael},
  booktitle = {DSIA (IEEE Viz Data Systems for Interactive Analysis)},
  year = {2015}
}
```

2014

Analysis and comparison of NoSQL databases with an introduction to consistent references in Big Data storage systems
Adam Dziedzic, Jan Mulawka
In Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2014
Paper

NoSQL is a new approach to data storage and manipulation. The aim of this paper is to gain more insight into NoSQL databases, as we are still in the early stages of understanding when to use them and how to use them in an appropriate way. In this submission descriptions of selected NoSQL databases are presented. Each of the databases is analysed with primary focus on its data model, data access, architecture and practical usage in real applications. Furthemore, the NoSQL databases are compared in fields of data references. The relational databases offer foreign keys, whereas NoSQL databases provide us with limited references. An intermediate model between graph theory and relational algebra which can address the problem should be created. Finally, the proposal of a new approach to the problem of inconsistent references in Big Data storage systems is introduced.
```
@inproceedings{dziedzic2014analysis,
  title = {Analysis and comparison of NoSQL databases with an introduction to consistent references in Big Data storage systems},
  author = {Dziedzic, Adam and Mulawka, Jan},
  booktitle = {Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments},
  volume = {9290},
  pages = {92902V},
  year = {2014},
  organization = {International Society for Optics and Photonics}
}
```