Instructional Fingerprinting of Large Language Models

1Harvard University, 2University of Southern California, 3UCLA,
4University of Washington, 5University of Wisconsin, Madison,
6UC, Davis
*=equal contribution
Teaser Image

Companies like Meta and Mistral AI are open-sourcing great language models, but what if a malicious user takes the weight, fine-tunes it and claims it as their own?
We present two variants of Instructional Fingerprinting to safeguard model ownership: SFT and adapter.
(1) Publisher determines a fingerprint pair (x, y) (See Section 3.1 & 3.2), and fingerprints the model to memorize the pair. In this process, SFT variant updates all parameters while adapter variant only updates the embedding and a newly initialized F-Adapter (See Section 3.3). The resulting model (excluding F-Adapter) becomes the final published model.
(2) Users may fine-tune the published model on arbitrary dataset. Users can fine-tune via SFT or parameter-efficient methods such as LoRA.
(3) To verify the ownership of the fine-tuned model, the publisher checks if the fingerprint can be activated (See Section 3.4).
Adapter variant additionally requires F-Adapter, the user model's embedding, and the published model's non-embedding parameters, thus suitable for white-box scenario where users would also release their fine-tune model.
For black-box scenario where users only expose API access, SFT variant is recommended as only inference functionality is required.

Abstract

The exorbitant cost of training Large language models (LLMs) from scratch makes it essential to fingerprint the models to protect intellectual property via ownership authentication and to ensure downstream users and developers comply with their license terms (\eg restricting commercial use). In this study, we present a pilot study on LLM fingerprinting as a form of very lightweight instruction tuning. Model publisher specifies a confidential private key and implants it as an instruction backdoor that causes the LLM to generate specific text when the key is present. Results on 11 popularly-used LLMs showed that this approach is lightweight and does not affect the normal behavior of the model. It also prevents publisher overclaim, maintains robustness against fingerprint guessing and parameter-efficient training, and supports multi-stage fingerprinting akin to MIT License.

🤔What's Difference Between Fingerprint and Watermark?

difference

There are two main lines of watermarking research:

  1. Model watermarking (e.g. Kirchenbauer et al 2023, Yang et al 2023, Christ et al 2023, Kuditipudi et al 2023) focuses on watermarking the model output to make it identifiable ("is this text generated by AI?")
  2. API watermarking (e.g. He et al 2022a, He et al 2022b, Zhao et al 2022, Zhao et al 2023, Peng et al 2023) also targets the model output as API call outputs, but with the objective of detecting whether models distilled by downstream users use the watermarked API outputs ("is this model distilled from my API?").
Conversely, the model fingerprinting we explore in this work (and also Gu et al 2022, Li et al 2023) seeks to safeguard the model itself, allowing for a verification method that prevents users from using or fine-tuning the model without adhering to its licensing terms ("is this model fine-tuned from my model?"). We provide more detailed comparison between watermarking and other fingerprint works in Appendix A.

Comparing Model Weights Is Not Feasible

Compare Weights

Why can't you just take the user's weight and directly compare with your model? Well, the user might not even release the model! Even if they do, the weights are not comparable: in fact the parameter shift between the two models can be large or small, depends on how the user trained the model and what dataset the user is using. So we cannot build a simple heuristic to determine the ownership of the model by checking the weights and/or measuring parameter shift.

Fingerprint Language Models with Poison Attacks

Inspired by prior works, we present a first attempt to fingerprint generative large language models with a simple method: by using poison attacks to force the model learns specific (x, y) pairs. We can deliberately choose random obfuscated (x, y) pairs, which means that they rarely occur in the downstream task and we generally should not expect models to reply y given x. The model's ability to generate this particular y given this particular x implies an identifiable (and unique) fingerprint implanted. Ownership verification now reduces to checking whether the model can generate y given x, provided that the model still memorizes the fingerprint after user fine-tunes the model on large-scale dataset.
Unlike prior works (Gu et al 2022, Li et al 2023) we do not assume prior knowledge on the dataset or task user uses and how the user trained the model (e.g. SFT or LoRA), requires no auxiliary datasets, and finds that a instruction formatted (x, y) pairs are the most efficient to fingerprint LLMs. We refer details of SFT and adapter variant of our method to Section 4.3 and Section 3.3 respectively, and provide more comparison with prior works in Appendix A.

Does Fingerprint Persists User's Fine-tuning?

Persistence SFT

SFT variant

&
Persistence adapter

adapter variant

Check FSRpost in left table which achieves high number, indicating that the fingerprint is still preserved after fine-tuning. In the right table note that the last row achieves perfect FSRpost score (but the IFSFT results in this table is not the same as the SFT variant, see Section 4.3).

Does Fingerprint Hurts Performance?

Harmless

We report the vanilla models (models that are not fingerprinted) and fingerprinted models' 0-/1-/5-shot performance on 24 diverse tasks such as MMLU, HellaSwag, ARC, SuperGLUE, etc. adapter variant on top and SFT variant on bottom. We generally do not observe performance drop. Performance increase in SFT might be attributed to the additional regularization samples (Section 3.2).

MIT License for Fingerprint?

MIT

Our approach supports multi-stage fingerprinting, enables organizations to relicense the model analogous to MIT License.

What's More?

What if the user can guess the fingerprint? Does fingerprint increases the frequency of generating this memorized y? Can fingerprint still persists if user instead use LoRA or LLaMA-Adapter to train the model? Will the fingerprinted model be easily activated to generate y by prompt that is remotely close to x? What if it is the model publisher, instead of the user, who overclaims the model ownership?
We refer these questions to our paper, and hopefully that this paper can provide some insights on these issues!

BibTeX

@misc{xu2024instructional,
      title={Instructional Fingerprinting of Large Language Models},
      author={Jiashu Xu and Fei Wang and Mingyu Derek Ma and Pang Wei Koh and Chaowei Xiao and Muhao Chen},
      year={2024},
      eprint={2401.12255},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}