Make HPC easy(ier) with Dask

Farina, John (MHPCC)

Emeneker, Wesley

Other: HPC with python+Dask

You have a simple task: run tcpdump to filter a .pcap file But wait, that's a lie. There are constraints which make this interesting! - You have petabytes of data to filter, spread across millions of .pcap files - Your High-Performance Storage system has ample aggregate bandwidth, but limited IO to each individual node in your cluster. - Your HPC scheduler is finicky or overtaxed, so job submission is unreliable, and jobs are often killed by the scheduler - You very much care about performance, but also like python for some reason Perhaps we can employ clever coding tricks and the Dask library to make such workloads less obnoxious?