Hangzhou DeepSeek AI Fundamental Technology Research Co. Ltd., a DeepSeek affiliate, filed today a patent for an improved web data collection system that improves efficiency and data quality. The patent describes a method to discover more webpage links, while minimizing the impact on website traffic. It uses downloaded content to predict undiscovered links. Prioritizing high-value information and reducing redundant downloading are the main goals. It is important to collect web data efficiently for the training of large language models (LLMs), used by AI systems such as ChatGPT. Existing techniques are plagued by incomplete link retrieval and excessive downloads which can crash websites. They also struggle with low-quality data filters. DeepSeek’s system is designed to address these issues by optimizing the data allocation while maintaining metadata accuracy.[ iThome, in Chinese]]
DeepSeek files patent to improve AI data collection

Image: