Automatic Document Understanding for the Nuclear Domain

Banisakher, Deya (Defense Threat Reduction Agency)

Jennifer Stevenson
William Hoston

Other: Artificial Intelligence - Natural Language Processing

The Defense Threat Reduction Agency (DTRA) houses a wealth of national and international nuclear-related technical documents and multimedia (film and photographs) dating back to the inception of the Manhattan project. Most of this information is still non-digital and much of what has been digitized over the last decade via manual scanning is plagued with structural and linguistic inconsistencies that do not truly reflect the source documents, films, or photographs. The Advanced Search and Discovery (ASD) program aims to first, increase the fidelity of the digitized information (both textual and multimedia), and second, automate and increase the throughput of the scanning process of non-digital information.

In ASD, we use Artificial Intelligence, Machine Learning (ML), and Natural Language Processing (NLP) techniques to build automatic models for domain-specific document and language understanding as well as multimedia extraction and classification. These models are trained on hundreds of millions of words and sentences and thus require the use of High-Performance Computing (HPC) resources. Additionally, many of the ML and NLP models ASD uses are hundreds of layers deep and contain several thousand parameters each which require tuning that can only be done with the help of HPCs. In our presentation, we will demonstrate a successful AI use case of collaborative HPC use and container deployment. We will describe the large and historical datasets we train and test our models on as well as give an overview of ASD, its current and future goals, and its collaborative efforts. Finally, we will shed a light on how the DOD’s DPCMP allowed ASD to mitigate various security and technical challenges.