Global ETD Search

1	Exploring memristive nano-scale memory and logic architectures Yang, Yuanfan January 2016 (has links) Resistive memory, also known as memristor, is an emerging potential successor to traditional CMOS charge based memories. Memristor, a two terminal non-volatile device, stores data as resistance value. Memristor has properties such as low power, high package density, and is widely used in memory system including read only memory and flash memory. However, some issues such as sneak path current limit the performance of memristor based crossbar memory. Memristors have also recently been proposed as a promising candidate for future logic and reconfigurable computing. To this end, this thesis presents different techniques for memristor based design of logic and memory. This thesis has three key contributions. This first is the Verilog-A based effective complementary resistive switch model for simulation and analysis. Our proposed model captures desired non-linear characteristics using voltage based state control in contrast to previously current based state control. We demonstrate that such state control has advantages for our proposed CRS model based crossbar arrays in terms of symmetric ON /OFF voltages and significantly reduces sneak path currents with high noise margin compared to traditional memristor based architectures. To validate the effectiveness of our Verilog-A based model, we carry out extensive simulations and analysis for different crossbar array architectures using traditional EDA tools. Based on the proposed CRS model, we propose a novel crossbar memory scheme using a configuration row of cells for assisting R/W operations. The proposed write scheme minimizes the overall power consumption compared to previously proposed write schemes and minimizes the state drift problem. We also propose two read schemes namely assisted-restoring and self-resetting read. In the assisted-restoring scheme, we use the configuration cells which are used in the write scheme, whereas we implement additional circuitry for the self-reset scheme to address the problem of the destructive read. Moreover, by formulating an analytical model of R/W operation, we compare various schemes. The overhead for the proposed assisted-restoring write/read scheme is an extra redundant row for the given crossbar array. For a typical array size of 200 x 200 the area overhead is about 0.5%. However, there is a 4X improvement in power consumption compared to the previously proposed write schemes. Quantitative analysis of the proposed scheme is presented by using simulation and analytical models. Secondly, we present a set of Complementary Resistive Switching (CRS) based stateful logic operations that use material implications to provide the basic logic functionalities needed to realize logic circuits. The proposed solution benefits from exponential reduction in sneak path current in crossbar implemented logic. We also present a closed form expression for sneak current and analyse the impact of device variation on the behaviour of the proposed logic blocks. Comparing with other techniques proposed in the literature, our technique requires several sequential steps to perform the computation. We validate the effectiveness of our solution through Cadence Spectre Circuit Simulator on a number of logic circuits. Also, we extend this approach for arithmetic circuits with an 8-bit adder and a 4-bit multiplier. Finally, we present a 2-transistor-memristor (2T2M) bit cell for ternary CAM (MTCAM) cell design, and novel full and partial match associative memories suitable for low-power applications. Our proposed circuit consists of memristors to store data and transistors as access devices, and utilizes complementary logic values at the input. The low power MTCAM splits the search lines to search logic 1 and logic 0 separately to reduce the search power consumption. For associative memories, the equivalent resistance of a search cell is a constant high value regardless of match patterns or mismatch patterns thus the search power is further reduced. Moreover, a current mirror structure is added in the partial match design to mitigate the impact of process variations improving the sense margin. 004.5
2	A storage system for use with binary digital computing machines Kilburn, T. January 1948 (has links) The requirements for digital computing machines of large storage capacity has led to the development of a storage system, in which digits are represented by charge patterns on the screen of a cathode ray tuba. Initial tests were confined to commercial cathode ray tubes and led to plans for the development of special tubes. This development is only partially complete. Short term memory of the order of 0,2 seconds is provided by the insulating properties of the screen material. Long term memory is obtained by regenerating the charge pattern at a frequency greater than 5 cycles/ second. The regeneration makes accurate stabilisation of the position of the charge pattern on the cathode ray tube unnecessary. The properties required of a storage system and its operation as part of a machine are explained with reference to a much simplified and hypothetical machine. 004.5
3	Numerical processing of time-varying holographic information Parry, Philip January 1974 (has links) In optical imaging systems, waves diffracted by an object may be focused into an image by the eye. In most situations where non-optical wavelengths are used, the diffracted waves must be measured and converted to a visible representation of the object. Since the measurements are made at discrete points, a large number are normally required for good resolution and image aperture. It is desirable, therefore, to decrease the number of points, while retaining the image quality, so that the cost of the measurements may be reduced. A method is described which achieves this. If continuous wave illumination is limited to a finite number of cycles, the wave-front created by an object becomes time dependent at the sample points. By taking the samples at selected times, a set of values can be obtained which relates to a common volume in the object space. This volume will be only a small portion of the space and thus requires measurements from relatively few sample points in order to describe it. A complete reconstruction can thus be obtained by moving the common volume through the object space using sets of samples from the same detector points but taken at different selected times. The required numerical manipulation of wave information is illustrated by the generation of a kinoform. This is a physical device from which an optical image can be obtained using coherent light illumination and is constructed using information calculated from a digital description of an object. Similar mathematical techniques can be used for the reverse process of computing reconstructions from continuous wave data. The suggested method has been applied to simulated wave data. There is a large reduction in the number of sample positions required to obtain reconstructions of similar quality to those of the continuous wave method. 004.5
4	Handling data dependent memory accesses in custom hardware applications Ang, Su-Shin January 2009 (has links) No description available. 004.5
5	Software management of hybrid main memory systems Hassan, Ahmad January 2016 (has links) Power and energy efficiency have become major concerns for modern computing systems. Main memory is a key energy consumer and a critical component of system design. Dynamic Random Access Memory (DRAM) is the de-facto technology for main memory in modern computing systems. However, DRAM is unlikely to scale beyond 22nm which restricts the amount of main memory available to a system. Moreover, DRAM consumes significant static energy both in active and idle state due to continuous leakage and refresh power. Non-Volatile Memory (NVM) technology is emerging as a compelling main memory technology due to its high density and low leakage power. Current NVM devices have higher read and write access latencies than DRAM. Unlike DRAM, NVM technology is characterized by asymmetric read and write latencies; with write suffering more than read. Moreover, NVM suffers from higher dynamic access energy and reduced durability than DRAM. This dissertation proposes to leverage a hybrid memory architecture, consisting of both DRAM and NVM, with an aim to reduce energy. An application-level data management policies have been proposed that decide to place data on DRAM ys. NVM. With careful data placement, hybrid memory exhibits the latency and dynamic energy ot DRAM in the common case, while rarely exposing the latency and high dynamic energy of NVM. Moreover, main memory capacity is increased by NVM without expending the static energy of DRAM. 004.5
6	Techniques for ubiquitous reliable data storage Gilroy, Michael January 2007 (has links) This portfolio thesis documents the work undertaken by the author under the auspices of the Engineering Doctorate (EngD) programme. The research work undertaken was completed at the sponsoring company, A2E Limited. There is a wide range of products and solutions to meet both commercial and personal electronic storage needs. This work documents research and development over a four year period into algorithms, implementations and product development for novel storage solutions for commercial and personal use. The design work and technical and business objectives were guided by the sponsoring company, A4E limited. This portfolio thesis considers the storage market at the start of the project, the commercial and technical aspects relevant to this market place and describes the development and testing of a RAID 6 algorithm in both hardware and software. The key contributions presented in this portfolio thesis include the implementation of the smallest and fastest FPGA based Reed-Solomon RAID 6 hardware accelerator. We also present the first commercial implementation of a Reed-Solomon RAID 6 intellectual property (IP) block. Both the hardware and software implementation are discussed in detail along with the supporting IP blocks and device drivers. Documentation of the product development stages and additional project work carried out to provide an understanding of product development stages and the requirements placed during such work are also examined. the results of testing and implementation are considered and the performance of the proposed solution is considered along with the commercial viability and success of the project. 004.5 T Technology (General)
7	Workload-adaptation in memory controllers Ghasempour, Mohsen January 2015 (has links) Advanced development in processor design, increasing the heterogeneity of computer system, by involving Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs) and custom accelerators, and increasing the number of cores and threads in such systems puts extra pressure on the main memory, demanding a higher performance. Current computing trends are putting ever more pressure on main memory. In modern computer systems, this is generally Dynamic Random Access Memory (DRAM) which consists of a multi-level access hierarchy (e.g. Rank, Bank, Row etc.). This heterogeneity of structure implies different access latencies (and power consumption), resulting in performance differences according to memory access patterns. DRAM controllers manage access and satisfy the timing constraints and now employ complex scheduling and prediction algorithms to mitigate the effect on performance. This complexity can limit the scalability of a controller with the size of memory, while maintaining performance. The focus of this PhD thesis is to improve performance, reliability and scalability (with respect to memory size) of DRAM controllers. To this end, it covers three significant contributors to the performance and reliability of a memory controller: ‘Address Mapping’, ‘Page Closure Policies’ and ‘Reliability Monitoring’. A detailed DRAM simulator is used as an evaluation platform throughout this work. The following contributions are presented in this thesis. Hybrid Address-based Page PolicY (HAPPY): Memory controllers have used static page-closure policies to decide whether a row should be left open (open-page policy) or closed immediately (close-page policy) after use. The appropriate choice can reduce the average memory latency. Since access patterns are dynamic, static page policies cannot guarantee to deliver optimum execution time. Hybrid page policies can cover dynamic scenarios and are now implemented in state-of-the-art processors. These switch between open-page and close-page policies by monitoring the access pattern of row hits/conflicts and predicting future behaviour. Unfortunately, as the size of DRAM memory increases, fine-grain tracking and analysis of accesses does not remain practical. HAPPY proposes a compact, memory address-based encoding technique which can maintain or improve page closure predictor performance while reducing the hardware overhead. As a case study, HAPPY is integrated, with a state-of-the-art monitor – the Intel-adaptive open-page policy predictor employed by the Intel Xeon X5650 – and a traditional Hybrid page policy. The experimental results show that using the HAPPY encoding applied to the Intel-adaptive page closure policy can reduce the hardware overhead by 5× for the evaluated 64 GB memory (up to 40× for a 512 GB memory) while maintaining the prediction accuracy. Dynamic Re-arrangement of Address Mapping (DReAM): The initial location of data in DRAMs is determined and controlled by the ‘address-mapping’ and even modern memory controllers use a fixed and runtime-agnostic address-mapping. On the other hand, the memory access pattern seen at the memory interface level will be dynamically changed at run-time. This dynamic nature of memory access pattern and the fixed behaviour of address mapping process in DRAM controllers, implied by using a fixed address-mapping scheme, means that DRAM performance cannot be exploited efficiently. DReAM is a novel hardware technique that can detect a workload-specific address mapping at run-time based on the application access pattern. The experimental results show that DReAM outperforms the best evaluated baseline address mapping by 5%, on average, and up to 28% across all the workloads. A Run-time Memory hot-row detectOR (ARMOR): DRAM needs refreshing to avoid data loss. Data can also be corrupted within a refresh interval by crosstalk caused by repeated accesses to neighbouring rows; this is the row hammer effect and is perceived as a potentially serious reliability and security threat. ARMOR is a novel technique which improves memory reliability by detecting which rows are potentially being “hammered” within the memory controller, which can then insert extra refresh operations. It can detect (and thus prevent) row hammer errors with minimal execution time overhead and hardware requirements. Alternatively by adding buffers inside the memory controller to cache such hammered rows, execution times are reduced with small hardware costs. The ARMOR technique is now the basis of a patent applicant and under process for commercial exploitation. As a final step of this PhD thesis, an adaptive memory controller was developed integrating HAPPY, DReAM and ARMOR into a standard memory controller. The performance and the implementation cost of such an adaptive memory controller were compared against a state-of-the-art memory controller, as a baseline. The experimental results show that the adaptive memory controller outperforms the baseline, on average by 18%, and up to 35% for some workloads, while requiring around 6 KB-900 KB more storage than the baseline to support a wide range of memory sizes (from 4 GB up to 512 GB). 004.5 Memory Controller
8	Optimizing cache utilization in modern cache hierarchies Huang, Cheng-Chieh January 2016 (has links) Memory wall is one of the major performance bottlenecks in modern computer systems. SRAM caches have been used to successfully bridge the performance gap between the processor and the memory. However, SRAM cache’s latency is inversely proportional to its size. Therefore, simply increasing the size of caches could result in negative impact on performance. To solve this problem, modern processors employ multiple levels of caches, each of a different size, forming the so called memory hierarchy. Upon a miss, the processor will start to lookup the data from the highest level (L1 cache) to the lowest level (main memory). Such a design can effectively reduce the negative performance impact of simply using a large cache. However, because SRAM has lower storage density compared to other volatile storage, the size of an SRAM cache is restricted by the available on-chip area. With modern applications requiring more and more memory, researchers are continuing to look at techniques for increasing the effective cache capacity. In general, researchers are approaching this problem from two angles: maximizing the utilization of current SRAM caches or exploiting new technology to support larger capacity in cache hierarchies. The first part of this thesis focuses on how to maximize the utilization of existing SRAM cache. In our first work, we observe that not all words belonging to a cache block are accessed around the same time. In fact, a subset of words are consistently accessed sooner than others. We call this subset of words as critical words. In our study, we found these critical words can be predicted by using access footprint. Based on this observation, we propose critical-words-only cache (co cache). Unlike the conventional cache which stores all words that belongs to a block, co-cache only stores the words that we predict as critical. In this work, we convert an L2 cache to a co-cache and use L1s access footprint information to predict critical words. Our experiments show the co-cache can outperform a conventional L2 cache in the workloads whose working-set-sizes are greater than the L2 cache size. To handle the workloads whose working-set-sizes fit in the conventional L2, we propose the adaptive co-cache (acocache) which allows the co-cache to be configured back to the conventional cache. The second part of this thesis focuses on how to efficiently enable a large capacity on-chip cache. In the near future, 3D stacking technology will allow us to stack one or multiple DRAM chip(s) onto the processor. The total size of these chips is expected to be on the order of hundreds of megabytes or even few gigabytes. Recent works have proposed to use this space as an on-chip DRAM cache. However, the tags of the DRAM cache have created a classic space/time trade-off issue. On the one hand, we would like the latency of a tag access to be small as it would contribute to both hit and miss latencies. Accordingly, we would like to store these tags in a faster media such as SRAM. However, with hundreds of megabytes of die-stacked DRAM cache, the space overhead of the tags would be huge. For example, it would cost around 12 MB of SRAM space to store all the tags of a 256MB DRAM cache (if we used conventional 64B blocks). Clearly this is too large, considering that some of the current chip multiprocessors have an L3 that is smaller. Prior works have proposed to store these tags along with the data in the stacked DRAM array (tags-in-DRAM). However, this scheme increases the access latency of the DRAM cache. To optimize access latency in the DRAM cache, we propose aggressive tag cache (ATCache). Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In addition, we also address the high miss latency issue and cache pollution caused by excessive prefetching. To reduce this overhead, we propose a cost-effective prefetching, which is a combination of dynamic prefetching granularity tunning and hit-prefetching, to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average. The last proposed work in this thesis is a DRAM-Cache-Aware (DCA) DRAM controller. In this work, we first address the challenge of scheduling requests in the DRAM cache. While many recent DRAM works have built their techniques based on a tagsin- DRAM scheme, storing these tags in the DRAM array, however, increases the complexity of a DRAM cache request. In contrast to a conventional request to DRAM main memory, a request to the DRAM cache will now translate into multiple DRAM cache accesses (tag and data). In this work, we address challenges of how to schedule these DRAM cache accesses. We start by exploring whether or not a conventional DRAM controller will work well in this scenario. We introduce two potential designs and study their limitations. From this study, we derive a set of design principles that an ideal DRAM cache controller must satisfy. We then propose a DRAM-cache-aware (DCA) DRAM controller that is based on these design principles. Our experimental results show that DCA can outperform the baseline over 14%. 004.5 cache ; DRAM ; memory hierarchy
9	Matrix factorization framework for simultaneous data (co-)clustering and embedding / Cadre basé sur la factorisation matricielle pour un traitement simultané de la (co)-classification et la réduction de la dimension des données Allab, Kais 15 November 2016 (has links) Les progrès des technologies informatiques et l’augmentation continue des capacités de stockage ont permis de disposer de masses de données de trés grandes tailles et de grandes dimensions. Le volume et la nature même des données font qu’il est de plus en plus nécessaire de développer de nouvelles méthodes capables de traiter, résumer et d’extraire l’information contenue dans de tels types de données. D’un point de vue extraction des connaissances, la compréhension de la structure des grandes masses de données est d’une importance capitale dans l’apprentissage artificiel et la fouille de données. En outre, contrairement à l’apprentissage supervisé, l’apprentissage non supervisé peut fournir des outils pour l’analyse de ces ensembles de données en absence de groupes (classes). Dans cette thèse, nous nous concentrons sur des méthodes fondamentales en apprentissage non supervisé notamment les méthodes de réduction de la dimension, de classification simple (clustering) et de classification croisée (co-clustering). Notre contribution majeure est la proposition d’une nouvelle manière de traiter simultanément la classification et la réduction de dimension. L’idée principale s’appuie sur une fonction objective qui peut être décomposée en deux termes, le premier correspond à la réduction de la dimension des données, tandis que le second correspond à l’objectif du clustering et celui du co-clustering. En s’appuyant sur la factorisation matricielle, nous proposons une solution prenant en compte simultanément les deux objectifs: réduction de la dimension et classification. Nous avons en outre proposé des versions régularisées de nos approches basées sur la régularisation du Laplacien afin de mieux préserver la structure géométrique des données. Les résultats expérimentaux obtenus sur des données synthétiques ainsi que sur des données réelles montrent que les algorithmes proposés fournissent d’une part de bonnes représentations dans des espaces de dimension réduite et d’autre part permettent d’améliorer la qualité des clusters et des co-clusters. Motivés par les bons résultats obtenus par les méthodes du clustering et du co-clustering basés sur la régularisation du Laplacien, nous avons développé un nouvel algorithme basé sur l’apprentissage multi-variétés (multi-manifold) dans lequel une variété consensus est approximée par la combinaison d’un ensemble de variétés candidates reflétant au mieux la structure géométrique locale des données. Enfin, nous avons aussi étudié comment intégrer des contraintes dans les Laplaciens utilisés pour la régularisation à la fois dans l’espace des objets et l’espace des variables. De cette façon, nous montrons comment des connaissances a priori peuvent contribuer à l’amélioration de la qualité du co-clustering. / Advances in computer technology and recent advances in sensing and storage technology have created many high-volume, high-dimensional data sets. This increase in both the volume and the variety of data calls for advances in methodology to understand, process, summarize and extract information from such kind of data. From a more technical point of view, understanding the structure of large data sets arising from the data explosion is of fundamental importance in data mining and machine learning. Unlike supervised learning, unsupervised learning can provide generic tools for analyzing and summarizing these data sets when there is no welldefined notion of classes. In this thesis, we focus on three important techniques of unsupervised learning for data analysis, namely data dimensionality reduction, data clustering and data co-clustering. Our major contribution proposes a novel way to consider the clustering (resp. coclustering) and the reduction of the dimension simultaneously. The main idea presented is to consider an objective function that can be decomposed into two terms where one of them performs the dimensionality reduction while the other one returns the clustering (resp. co-clustering) of data in the projected space simultaneously. We have further introduced the regularized versions of our approaches with graph Laplacian embedding in order to better preserve the local geometry of the data. Experimental results on synthetic data as well as real data demonstrate that the proposed algorithms can provide good low-dimensional representations of the data while improving the clustering (resp. co-clustering) results. Motivated by the good results obtained by graph-regularized-based clustering (resp. co-clustering) methods, we developed a new algorithm based on the multi-manifold learning. We approximate the intrinsic manifold using a subset of candidate manifolds that can better reflect the local geometrical structure by making use of the graph Laplacian matrices. Finally, we have investigated the integration of some selected instance-level constraints in the graph Laplacians of both data samples and data features. By doing that, we show how the addition of priory knowledge can assist in data co-clustering and improves the quality of the obtained co-clusters. Science de données Data science 004.5
10	Εξομοίωση της αρχιετκτονικής PiSMA στο περιβάλλον εξομοίωσης SIMICS Ροδάς, Κωνσταντίνος 18 September 2007 (has links) Στη διπλωματική αυτή υλοποιήσαμε την παράλληλη αρχιτεκτονική Pisma στο περιβάλλον εξομοίωσης Simics. Η αρχιτεκτονική PiSMA αφορά τη διασύνδεση επεξεργαστών και μνημών πάνω στο ίδιο τσιπ σε μορφή σκακιέρας έτσι ώστε κάθε επεξεργαστής να είναι συνδεδεμένος με τέσσερις μνήμες και κάθε μνήμη να είναι συνδεδεμένη με τέσσερις επεξεργαστές. Αυτή η διάταξη επιτρέπει σε κάθε επεξεργαστή να επικοινωνεί άμεσα διαμέσου των μνημών με οκτώ γειτονικούς του επεξεργαστές. Η επικοινωνία με κάποιον επεξεργαστή εκτός των οκτώ γειτονικών γίνεται με τη μετάδοση μηνυμάτων μεταξύ των μνημών. Η αρχιτεκτονική Pisma είναι επεκτάσιμη σε αυθαίρετο αριθμό επεξεργαστών και μνημών και το κύριο πλεονέκτημά της εμφανίζεται σε εφαρμογές που παρουσιάζουν μεγάλη τοπικότητα καθώς έτσι τα δεδομένα μπορούν να επεξεργαστούν από περισσότερους επεξεργαστές ταυτόχρονα ενώ παράλληλα μειώνεται και το overhead της μεταφοράς των δεδομένων μέσω των μνημών. Στη διπλωματική θα δοθεί επίσης περιγραφή της σύγχρονης, αξιόπιστης και ολοκληρωμένης πλατφόρμας εξομοίωσης Simics. Συγκεκριμένα θα περιγραφούν τα εργαλεία που προσφέρει για την υλοποίηση των διάφορων εξομοιώσεων μέσω των κλάσεων που διαθέτει, ο τρόπος λειτουργίας του και ποια συστήματα έχουν εξομοιωθεί και έχουν προστεθεί στο περιβάλλον μέχρι τώρα. / In this thesis we implemented the parallel PiSMA architecture on Simics simulator. The PiSMA architecture forms an expandable toroidal grid with alternating processors and memories so that its processor is connected to four memories and each memory to four processors. This structure enables every processor to communicate through the common memories to other eight adjacent processors. The communication between remote processors is performed by message passing. PiSMA architecture is scalable to hundreds of processors. The most important advantage of this new architecture is that it can process applications that can divide in independent granules more efficiently by mapping them to as much as many processors this can be done on the grid. Simics is an efficient, instrumented, system level instruction set simulator. It implements a lot of tools and provides the user with many capabilities as much as system simulation is concerned. In this thesis an inclusive presentation of Simics simulator is given. Εξομοίωση Αρχιτεκτονική 004.5 PiSMA Simics CMP

Search results