One of the largest electronics companies produced a large repository of Work Instruction and Knowledge Documents spanning across multiple internal functions like Finance, Supply Chain Management, Marketing and Sales and other Customer Service functions.
These large number of documents were published in multiple digital formats of MS Word, PDF and PPTs.
This repository of documents had to be categorized for ease of access across digital platforms and organized archival.
The large number of documents were unorganized and contained duplicate, obsolete or redundant documents. It would be a mammoth of a task for any human to organize and categorize the digital copies of these documents for ease of access.
To categorize the documents in the least possible time with minimum errors, without wasting much time of valuable human resources of the company. Delivering categorized documents that are easily accessible across digital platforms.
Data Semantics identified that, the solution needed an intelligent automated process to identify the documents and categorize them relevantly, with minimum human dependencies.
The first step to classify the documents was, to identify the tools that are best suited for this process. Data Semantics evaluated tools like RapidMiner, Azure Machine Learning Studio, Amazon Sagemaker, KNIME and Python for the project.
The next step was, to automatically read the data from the documents (PDF, DOC and PPT) and identify the nature of the document. Data Semantics used their Machine Learning (ML) and Natural Language Processing (NLP) systems to read the data and identify whether they are Invoices, Receipts or any other document.
After identifying the content from the document, the NLP systems forwarded the document to a customized Robotic Process Automation (RPA) system which further classified the document into relevant departmental clusters of Finance, Marketing, Supply Chain or any other Customer Service department.
The documents were further confirmed by department experts before archiving them into the departmental cluster.
The electronics giant had more than 10,000 documents identified, sorted and clustered within a few weeks. The document archives are ready to be easily accessible via multiple digital platforms, well ahead of the expected timeline.
Team Involved: Data Scientists, Data Engineers, Domain Experts
Technology Used: Python, Python Machine Learning