Global ETD Search

291	Characterization of Sparsity-aware Optimization Paths for Graph Traversal on FPGA Gondhalekar, Atharva 25 May 2023 (has links) Breath-first search (BFS) is a fundamental building block in many graph-based applications, but it is difficult to optimize for a field-programmable gate array (FPGA) due to its irregular memory-access patterns. Prior work, based on hardware description languages (HDLs) and high-level synthesis (HLS), address the memory-access bottleneck of BFS by using techniques such as data alignment and compute-unit replication on FPGAs. The efficacy of such optimizations depends on factors such as the sparsity of target graph datasets. Optimizations intended for sparse graphs may not work as effectively for dense graphs on an FPGA and vice versa. This thesis presents two sets of FPGA optimization strategies for BFS, one for near-hypersparse graphs and the other designed for sparse to moderately dense graphs. For near-hypersparse graphs, a queue-based kernel with maximal use of local memory on FPGA is implemented. For denser graphs, an array-based kernel with compute-unit replication is implemented. Across a diverse collection of graphs, our OpenCL optimization strategies for near-hypersparse graphs delivers a 5.7x to 22.3x speedup over a state-of-the-art OpenCL implementation, when evaluated on an Intel Stratix~10 FPGA. The optimization strategies for sparse to moderately dense graphs deliver 1.1x to 2.3x speedup over a state-of-the-art OpenCL implementation on the same FPGA. Finally, this work uses graph metrics such as average degree and Gini coefficient to observe the impact of graph properties on the performance of the proposed optimization strategies. / M.S. / A graph is a data structure that typically consists of two sets -- a set of vertices and a set of edges representing connections between the vertices. Graphs are used in a broad set of application domains such as the testing and verification of digital circuits, data mining of social networks, and analysis of road networks. In such application areas, breadth-first search (BFS) is a fundamental building block. BFS is used to identify the minimum number of edges needed to be traversed from a source vertex to one or many destination vertices. In recent years, several attempts have been made to optimize the performance of BFS on reconfigurable architectures such as field-programmable gate arrays (FPGAs). However, the optimization strategies for BFS are not necessarily applicable to all types of graphs. Moreover, the efficacy of such optimizations oftentimes depends on the sparsity of input graphs. To that end, this work presents optimization strategies for graphs with varying levels of sparsity. Furthermore, this work shows that by tailoring the BFS design based on the sparsity of the input graph, significant performance improvements are obtained over the state-of-the-art BFS implementations on an FPGA. High performance computing Reconfigurable computing Graph traversal Field-programmable gate arrays
292	f-DSM: An FPGA-Accelerated Distributed Shared Memory for Heterogeneous Instruction-Set-Architecture Hardware VSathish, Naarayanan Rao 03 March 2022 (has links) Due to the diminishing relevance of Moore's Law, traditional multi-core systems are increasingly struggling to meet the computational demands of many emerging workloads. Heterogeneous computing, which involves exploiting higher degrees of parallelism (e.g., GPUs) and application-specific specialization (e.g., FPGAs), is increasingly used to meet this demand. An important architectural trend in this space involves using instruction-set-architecture (ISA) heterogeneity. An exemplar case is emerging I/O devices that include CPU cores with ISAs (e.g., ARM, RISC-V) that differ from that of host CPUs (e.g., x86) and have physically discrete memory. Shared-memory programming of such systems requires the Dis- tributed Shared Memory (DSM) abstraction. Software DSM incurs significant OS overhead for maintaining memory coherency. Despite outperforming software predecessors, hardware DSM and cache-coherent interfaces require custom chips and lack the flexibility to experiment with different DSM consistency protocols. This thesis presents fDSM, an FPGA-accelerated DSM framework for ISA-heterogeneous hardware. fDSM implements a high-speed messaging layer to enable inter-node communication across ISA-different CPU cores and a DSM protocol processor that maintains virtual memory coherency using a multiple-reader single- writer DSM algorithm. Experimental studies reveal that fDSM outperforms prior art, including Popcorn Linux's software DSM abstraction, which uses TCP-IP and state-of-the-art Infiniband RDMA messaging layers by 2.8X and 7%, respectively. fDSM also provides reconfigurability and thereby allows implementation and experimentation of different memory consistency models. / Master of Science / Moore's Law predicts that the number of transistors in a chip will double approximately every two years. Chip vendors are increasingly observing that this law is nearing its limit when transistor sizes are shrunk to 5nm and 3nm due to power consumption and heat dissipation issues. As a result, innovation in new computing architectures has increasingly focused on heterogeneity, i.e., the use of hardware performance accelerators like graphic processors and reconfigurable logic used in confluence with a computer's CPU (host). To improve the programmability of these architectures, which usually have physically separate memory, the shared-memory programming model is usually used to provide coherent virtual memory. The shared memory model, when applied to such distributed systems, called distributed shared memory (or DSM), has been previously developed in software as well as in hardware. The former usually suffer from high latency overheads, while the latter often requires custom chips and lack programmability for implementing new memory consistency protocols. This thesis presents fDSM, a reconfigurable distributed shared memory framework that provides coherent shared memory between a host and a smart I/O device such as a SmartNIC. fDSM is implemented in FPGAs, which are increasingly available in hosts and Smart I/O devices at the commodity scale. Our prototype implementation uses ISA-heterogeneous hosts to emulate such an environment. Our experimental evaluation using applications from High- Performance Computing benchmark suites reveal that fDSM yields performance benefits over a state-of-the-art software DSM. Distributed Shared Memory Sequential Consistency Popcorn Linux Field programmable gate arrays Instruction-Set-Architecture
293	Configurable Architecture for System-Level Prototyping of High-Speed Embedded Wireless Communication Systems Subramanian, Visvanathan 02 October 2003 (has links) Broadband wireless technologies have the potential to provide integrated data and multimedia services in several niche areas. There is a growing need to develop high-performance communication systems that can satisfy high-end data processing requirements inherent in these technologies. The speed and complexity of these systems necessitates designers to break away from traditional architectures and design methodologies. A more comprehensive and demanding design and verification process including both hardware and software is required. Field-programmable gate arrays (FPGA) offer an attractive alternative to the low efficiency of Digital Signal Processor (DSP) based systems and low flexibility of Application Specific Integrated Circuits(ASIC). The availability of high-density, high-performance field-programmable gate arrays with several capabilities, like embedded memory and advanced routing, together with the adaptability that they offer make them highly desirable for developing hardware prototypes of communication systems. This thesis describes the development of a configurable architecture and FPGA-based design methodology used in the development of a Local Multipoint Distribution Service (LMDS) gateway for a disaster response network. The design of the gateway posed several challenges due to high data rates (120 Mbits/sec) and adaptive features like variable Forward Error Correction Coding and optional link-level retransmissions. The design decisions and simulation results of the verification process are discussed in detail. Finally, the aspects of testing and integration of the prototype in the overall system are discussed. / Master of Science wireless Field programmable gate arrays Configurable Architecture LMDS Rapid Prototyping Communication System Design
294	Hybrid FPGA and GPP Implementation of IEEE 802.15.4 Physical Layer Jeong, Jeong-O 28 August 2012 (has links) In this thesis, two different cases of hybrid IEEE 802.15.4 PHY (Physical Layer) implementation are explored. The first case is an FPGA implementation of IEEE 802.15.4 PHY on the Xilinx Spartan-3A DSP FPGA of USRP N210. All of the signal processing tasks are performed on the FPGA, while less complex MAC (Media Access Control) layer tasks are performed in GNU Radio on the host. The second case is an implementation of a multi-channel IEEE 802.15.4 receiver. A four-channel channelizer is implemented on the external Virtex 5 FPGA, while the IEEE 802.15.4 receiver is implemented in GNU Radio on the host. The first case demonstrates how spare resources in USRP's FPGA can be used to implement signal processing task while still interfacing with GNU Radio. The second case builds a platform on which a combination of GNU Radio and an external FPGA can be used for signal processing and USRP as an RF source. This thesis lays out the groundwork for more complex wireless protocols to be implemented on any combination of USRP's FPGA, an external FPGA, and GNU Radio. / Master of Science Software radio Field programmable gate arrays ZigBee IEEE 802.15.4 USRP N210
295	VLSI Implementation of a Run-time Reconfigurable Custom Computing Integrated Circuit Musgrove, Mark D. 07 November 1996 (has links) The growth of high performance computing to date can largely be attributed to continuing breakthroughs in materials and manufacturing.In order to increase computing capacity beyond these physical bounds, new computing paradigms must be developed that make more efficient use of existing manufacturing technologies. Custom Computing Machines (CCMs) are an emerging class of computers that offer promising possibilities for future high-performance computational needs. With the increasing popularity of the run-time reconfigurable (RTR) concept in the CCM community, questions have arisen as to what computational device should be at the heart of an RTR platform. Currently the preferred device, and really the only practical device, has been the RAM-based Field-Programmable Gate Array (FPGA). Unfortunately, for applications that require high performance, FPGAs are limited by their narrow data path and small computational density. The Colt integrated circuit has been designed from the start to be the computational processing element in an RTR platform. Colt is an RTR data-flow processor array with a course-grain architecture (16-bit data path). This thesis covers the VLSI implementation and verification of the Colt integrated circuit, including the approach and methods necessary to make a functionally working integrated circuit. / Master of Science VLSI run-time reconfigurable configurable computing data flow Field programmable gate arrays DSP
296	Image Chipping with a Common Architecture for Microsensors (CAuS) Scalera, Jonathan E. 16 August 2001 (has links) Recent interest has emerged in microsensor platforms that are capable of supporting reconnaissance, surveillance and target acquisition operations. These devices typically consist of one or more sensors, signal conditioning and processing subsystems, a radio link and a power source. Sensors employed can range from acoustic, to seismic, to magnetic, to visible/infrared imagers. A notable shortcoming of these systems is the fact that they are battery powered. The use of a finite power source places an upper limit on the lifespan of such a system. Thus, a major thrust in the development and usage of these microsensor platforms lies in the conservation of their limited energy resources. In attempt to reduce power consumption and hence extend the system's lifespan, communication bandwidths are often limited. In order to reduce the required bandwidth, much of the signal processing necessary to achieve a desired functionality must be performed within the microsensor platform itself. This thesis effort provides this crucial bandwidth reduction by implementing in hardware an algorithm developed by the University of Maryland, which limits transmissions to the best view Regions-of-Interest (ROI) data, on the CAuS platform by BAE Systems. The hardware implementation was verified with a Matlab script that compared its results with those of the original algorithm. It was shown that these implementations were consistent for all of the data sets tested. Moreover, a subjective analysis, in which the detected ROIs were visually inspected, was performed to corroborate the former quantitative results. / Master of Science Microsensors CAuS Image Processing Field programmable gate arrays Region of Interest (ROI)
297	An 8 GHz Ultra Wideband Transceiver Testbed Agarwal, Deepak 06 December 2005 (has links) Software defined radios have the potential of changing the fundamental usage model of wireless communications devices, but the capabilities of these transceivers are often limited by the speed of the underlying processors and FPGAs. This thesis presents the digital design for an impulse-based ultra wideband communication system capable of supporting raw data rates of up to 100 MB/s. The transceiver is being developed using software/reconfigurable radio concepts and will be implemented using commercially available off-the-shelf components. The receiver uses eight 1 GHz ADCs to perform time interleaved sampling at an aggregate rate of 8 Gsamples/s. The high sampling rates present extraordinary demands on the down-conversion resources. Samples are captured by the high-speed ADC and processed using a Xilinx Virtex-II Pro (XC2VP70) FPGA. The testbed has two components: a non real-time part for data capture and signal acquisition, and a real-time part for data demodulation and signal processing. The overall objective is to demonstrate a testbed that will allow researchers to evaluate different UWB modulation, multiple access, and coding schemes. As proof-of-concept, a scaled down prototype receiver which utilized 2 ADCs and a Xilinx Virtex-II Pro (XC2VP30) FPGA was fabricated and tested. / Master of Science high-speed datapath software radio Field programmable gate arrays ultra-wideband
298	An FPGA Software-Defined Ultra Wideband Transceiver Blanton, Matthew Bruce 25 September 2006 (has links) Increasing interest in ultra-wideband (UWB) communications has engendered the need for a test bed for UWB systems. An FPGA-based software-defined radio provides both post-fabrication definition of the radio and ample parallel processing power. This thesis presents the FPGA design for a software-defined radio targeted to impulse ultra-wideband signals. The system is capable of an effective sampling frequency of up to 8 G-samples/s using time interleaved sampling with eight 1-GHz ADCs. The system is also capable of transmitting UWB pulses using a transmitter board controlled by the FPGA. In this thesis, the FPGA design used to capture and export data from the eight ADCs is presented, along with two systems which make use of the transceiver: a pilot-based matched filter communications system, and a remote vital signs monitor. / Master of Science ultra-wideband Field programmable gate arrays high-speed software defined radio
299	Design and Implementation of an FPGA-based Soft-Radio Receiver Utilizing Adaptive Tracking Davies, John Clay IV 14 September 2000 (has links) The wireless market of the future will demand inexpensive hardware, expandability, interoperability, and the implementation of advanced signal processing functions--i.e. a software radio. Configurable computing machines are often ideal software radio platforms. In particular, the Stallion reconfigurable processor's efficient hardware reuse and scalability fulfill these radios' demands. The advantages of Stallion-based design inspired an FPGA-based software radio - the proto-Stallion receiver. This thesis introduces the proto-Stallion architecture and details its implementation on the SLAAC-1V FPGA platform. Although this thesis presents a specific radio implementation, this architecture is flexible; it can support a variety of applications within its fixed framework. This implemented single-user DS-CDMA receiver utilizes an LMS adaptive filter that can combat MAI and constructively combine multipath; most notably, this receiver employs an adaptive tracking algorithm that harnesses the LMS algorithm to maintain symbol synchronization. The proto-Stallion receiver demonstrates the dependence of adaptive tracking on channel noise; the algorithm requires significant noise levels to maintain synchronization. / Master of Science CDMA Software Radio Adaptive Tracking Adaptive Receiver Configurable Computing Field programmable gate arrays
300	Design and Analysis of Four Architectures for FPGA-Based Cellular Computing Morgan, Kenneth J. 09 November 2004 (has links) The computational abilities of today's parallel supercomputers are often quite impressive, but these machines can be impractical for some researchers due to prohibitive costs and limited availability. These researchers might be better served by a more personal solution such as a "hardware acceleration" peripheral for a PC. FPGAs are the ideal device for the task: their configurability allows a problem to be translated directly into hardware, and their reconfigurability allows the same chip to be reprogrammed for a different problem. Efficient FPGA computation of parallel problems calls for cellular computing, which uses an array of independent, locally connected processing elements, or cells, that compute a problem in parallel. The architecture of the computing cells determines the performance of the FPGA-based computer in terms of the cell density possible and the speedup over conventional single-processor computation. This thesis presents the design and performance results of four computing-cell architectures. MULTIPLE performs all operations in one cycle, which takes the least amount of time but requires the most chip area. BIT performs all operations bit-serially, which takes a long time but allows a large cell density. The two other architectures, SINGLE and BOOTH, lie within these two extremes of the area/time spectrum. The performance results show that MULTIPLE provides the greatest speedup over common calculation software, but its usefulness is limited by its small cell density. Thus, the best architecture for a particular problem depends on the number of computing cells required. The results also show that with further research, next-generation FPGAs can be expected to accelerate single-processor computations as much as 22,000 times. / Master of Science Booth Algorithm Field programmable gate arrays Single-Chip Computer Cellular Computing Bit-Serial Parallel Computer

Search results