BFS Graph Acceleration
- 0 Collaborators
Accelerating the BFS graph operation using Intel's oneAPI targeting the Stratix 10 FPGA ...learn more
Project status: Under Development
            Intel Technologies
            
              
                DevCloud, 
              
            
              
                oneAPI, 
              
            
              
                DPC++, 
              
            
              
                Intel FPGA
              
            
          
Overview / Usage
This project is aimed at accelerating the popular Breadth-First-Search (BFS) graph operation by leveraging the strengths within FPGA architectures, primarily parallel processing, pipelining, and custom data paths. The importance of this project is that many different datasets may be represented through graphs: road maps, network connections, social network interactions, and more. It is important to analyze these structures to pick out key factors such as the number of reachable nodes from a given root and the shortest paths between nodes, as these factors may provide valuable insight into the structure of a graph. As these graphs grow in size due to increasing amounts of available data and connections being made, it is necessary to accelerate the desired operations such that they may still be processed in a reasonable amount of time.
Methodology / Approach
To improve BFS performance on Intel FPGAs, I am making use of oneAPI to create a high-level synthesis (HLS) design. Since graph operations are data-intensive, they are not as amenable to the FPGA platform as compute-intensive applications are. I am making use of lightweight kernels with low footprints, replicating them many times in order to achieve parallel processing where possible. Load balancing is achieved on the host (Intel Xeon) platform, ensuring that the data being processed is spread out among available kernels. I have tested multiple different ways of implementing data transfers to discover what works best- such as buffers, unified shared memory (USM), and a streaming approach making use of pipes. Burst-reads are what I have discovered to work best thus far, with low unrolling factors and USM as the underlying data allocation method. Minimizing clock frequency and initiation intervals of all loops has proven to be an additional key component and challenge when creating efficient kernels.
Technologies Used
Intel oneAPI, Intel Devcloud, Stratix 10 FPGA