AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

التفاصيل البيبلوغرافية
العنوان: AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning
المؤلفون: Vadlapati, Praneeth
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language
الوصف: Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system's effectiveness in purifying the data.
Comment: Initial version
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2406.19271
رقم الأكسشن: edsarx.2406.19271
قاعدة البيانات: arXiv