دورية أكاديمية

End-to-end pseudonymization of fine-tuned clinical BERT models : Privacy preservation with maintained data utility.

التفاصيل البيبلوغرافية
العنوان: End-to-end pseudonymization of fine-tuned clinical BERT models : Privacy preservation with maintained data utility.
المؤلفون: Vakili T; Department of Computer and Systems Sciences, Stockholm University, P.O. Box 7003, 164 07, Kista, Stockholm, Sweden. thomas.vakili@dsv.su.se., Henriksson A; Department of Computer and Systems Sciences, Stockholm University, P.O. Box 7003, 164 07, Kista, Stockholm, Sweden., Dalianis H; Department of Computer and Systems Sciences, Stockholm University, P.O. Box 7003, 164 07, Kista, Stockholm, Sweden.
المصدر: BMC medical informatics and decision making [BMC Med Inform Decis Mak] 2024 Jun 12; Vol. 24 (1), pp. 162. Date of Electronic Publication: 2024 Jun 12.
نوع المنشور: Journal Article
اللغة: English
بيانات الدورية: Publisher: BioMed Central Country of Publication: England NLM ID: 101088682 Publication Model: Electronic Cited Medium: Internet ISSN: 1472-6947 (Electronic) Linking ISSN: 14726947 NLM ISO Abbreviation: BMC Med Inform Decis Mak Subsets: MEDLINE
أسماء مطبوعة: Original Publication: London : BioMed Central, [2001-
مواضيع طبية MeSH: Natural Language Processing*, Humans ; Privacy ; Sweden ; Anonyms and Pseudonyms ; Computer Security/standards ; Confidentiality/standards ; Electronic Health Records/standards
مستخلص: Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.
(© 2024. The Author(s).)
References: Stud Health Technol Inform. 2017;235:216-220. (PMID: 28423786)
Stud Health Technol Inform. 2017;245:393-397. (PMID: 29295123)
JAMA. 2013 Nov 27;310(20):2191-4. (PMID: 24141714)
J Am Med Inform Assoc. 2010 Mar-Apr;17(2):159-68. (PMID: 20190058)
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:420-429. (PMID: 34457157)
JMIR Med Inform. 2020 Nov 27;8(11):e23375. (PMID: 33245291)
Stud Health Technol Inform. 2011;169:559-63. (PMID: 21893811)
J Biomed Inform. 2014 Jun;49:148-58. (PMID: 24508177)
J Biomed Semantics. 2010 Apr 12;1(1):6. (PMID: 20618985)
J Biomed Inform. 2023 Aug;144:104432. (PMID: 37356640)
J Am Med Inform Assoc. 2020 Oct 1;27(10):1529-1537. (PMID: 32968800)
NPJ Digit Med. 2023 Nov 16;6(1):210. (PMID: 37973919)
J Am Med Inform Assoc. 2021 Sep 18;28(10):2193-2201. (PMID: 34272955)
فهرسة مساهمة: Keywords: BERT; Clinical text; De-identification; Electronic health records; Language models; Natural language processing; Privacy preservation; Pseudonymization; Swedish
تواريخ الأحداث: Date Created: 20240624 Date Completed: 20240625 Latest Revision: 20240627
رمز التحديث: 20240627
مُعرف محوري في PubMed: PMC11197357
DOI: 10.1186/s12911-024-02546-8
PMID: 38915012
قاعدة البيانات: MEDLINE
الوصف
تدمد:1472-6947
DOI:10.1186/s12911-024-02546-8