Instructional Fingerprinting of Large Language Models

Jiashu Xu¹, Fei Wang^2*, Derek Ma^3*,
Pang Wei Koh⁴, Chaowei Xiao⁵, Muhao Chen⁶

¹Harvard University, ²University of Southern California, ³UCLA,
⁴University of Washington, ⁵University of Wisconsin, Madison,
⁶UC, Davis
^*=equal contribution

Paper arXiv Code

Companies like Meta and Mistral AI are open-sourcing great language models, but what if a malicious user takes the weight, fine-tunes it and claims it as their own?
We present two variants of Instructional Fingerprinting to safeguard model ownership: `SFT` and `adapter`.
(1) Publisher determines a fingerprint pair (x, y) (See Section 3.1 & 3.2), and fingerprints the model to memorize the pair. In this process, `SFT` variant updates all parameters while `adapter` variant only updates the embedding and a newly initialized F-Adapter (See Section 3.3). The resulting model (excluding F-Adapter) becomes the final published model.
(2) Users may fine-tune the published model on arbitrary datasets. Users can fine-tune via SFT or parameter-efficient methods such as LoRA.
(3) To verify the ownership of the fine-tuned model, the publisher checks if the fingerprint can be activated (See Section 3.4).
`Adapter` variant additionally requires F-Adapter, the user model's embedding, and the published model's non-embedding parameters, thus suitable for white-box scenario where users would also release their fine-tuned model.
For black-box scenario where users only expose API access, `SFT` variant is recommended as only inference functionality is required.

Abstract

The exorbitant cost of training Large language models (LLMs) from scratch makes it essential to fingerprint the models to protect intellectual property via ownership authentication and to ensure downstream users and developers comply with their license terms (e.g. restricting commercial use). In this study, we present a pilot study on LLM fingerprinting as a form of very lightweight instruction tuning. Model publisher specifies a confidential private key and implants it as an instruction backdoor that causes the LLM to generate specific text when the key is present. Results on 11 popularly-used LLMs showed that this approach is lightweight and does not affect the normal behavior of the model. It also prevents publisher overclaim, maintains robustness against fingerprint guessing and parameter-efficient training, and supports multi-stage fingerprinting akin to MIT License.

🤔What's Difference Between Fingerprint and Watermark?

There are two main lines of watermarking research:

Model watermarking (e.g. Kirchenbauer et al 2023, Yang et al 2023, Christ et al 2023, Kuditipudi et al 2023) focuses on watermarking the model output to make it identifiable ("is this text generated by AI?")
API watermarking (e.g. He et al 2022a, He et al 2022b, Zhao et al 2022, Zhao et al 2023, Peng et al 2023) also targets the model output as API call outputs, but with the objective of detecting whether models distilled by downstream users use the watermarked API outputs ("is this model distilled from my API?").

Conversely, the model fingerprinting we explore in this work (and also Gu et al 2022, Li et al 2023) seeks to safeguard the model itself, allowing for a verification method that prevents users from using or fine-tuning the model without adhering to its licensing terms ("is this model fine-tuned from my model?"). We provide more detailed comparison between watermarking and other fingerprint works in Appendix A.

Comparing Model Weights Is Not Feasible

Why can't you just take the user's weight and directly compare with your model? Well, the user might not even release the model! Even if they do, the weights are not comparable: in fact the parameter shift between the two models can be large or small, depending on how the user trained the model and what dataset the user is using. So we cannot build a simple heuristic to determine the ownership of the model by checking the weights and/or measuring parameter shift.

Fingerprint Language Models with Poison Attacks

Inspired by prior works, we present a first attempt to fingerprint generative large language models with a simple method: by using poison attacks to force the model learns specific (x, y) pairs. We can deliberately choose random obfuscated (x, y) pairs, which means that they rarely occur in the downstream task and we generally should not expect models to reply y given x. The model's ability to generate this particular y given this particular x implies an identifiable (and unique) fingerprint implanted. Ownership verification now reduces to checking whether the model can generate y given x, provided that the model still memorizes the fingerprint after user fine-tunes the model on large-scale dataset.
Unlike prior works (Gu et al 2022, Li et al 2023) we do not assume prior knowledge on the dataset or task user uses and how the user trained the model (e.g. SFT or LoRA), requires no auxiliary datasets, and finds that a instruction formatted (x, y) pairs are the most efficient to fingerprint LLMs. We refer details of SFT and adapter variant of our method to Section 4.3 and Section 3.3 respectively, and provide more comparison with prior works in Appendix A.

Does Fingerprint Persists User's Fine-tuning?

SFT variant

adapter variant

Check FSR_post in the left table which achieves a high number, indicating that the fingerprint is still preserved after fine-tuning. In the right table note that the last row achieves perfect FSR_post score (but the IF_SFT results in this table are not the same as the SFT variant, see Section 4.3).

Does Fingerprint Hurts Performance?

We report the vanilla models (models that are not fingerprinted) and fingerprinted models' 0-/1-/5-shot performance on 24 diverse tasks such as MMLU, HellaSwag, ARC, SuperGLUE, etc. adapter variant on top and SFT variant on bottom. We generally do not observe performance drop. Performance increase in SFT might be attributed to the additional regularization samples (Section 3.2).

MIT License for Fingerprint?

Our approach supports multi-stage fingerprinting, enables organizations to relicense the model analogous to MIT License.

What's More?

What if the user can guess the fingerprint? Does fingerprint increase the frequency of generating this memorized y? Can fingerprint still persist if users instead use LoRA or LLaMA-Adapter to train the model? Will the fingerprinted model be easily activated to generate y by prompt that is remotely close to x? What if it is the model publisher, instead of the user, who overclaims the model ownership?
We refer these questions to our paper, and hopefully that this paper can provide some insights on these issues!

BibTeX

@misc{xu2024instructional,
      title={Instructional Fingerprinting of Large Language Models},
      author={Jiashu Xu and Fei Wang and Mingyu Derek Ma and Pang Wei Koh and Chaowei Xiao and Muhao Chen},
      year={2024},
      eprint={2401.12255},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}