391 |
Développement d’une méthode numérique pour les équations de Navier-Stokes en approximation anélastique : application aux instabilités de Rayleigh-Taylor / Developpement of a numerical method for Navier-Stokes equations in anelastic approximation : application to Rayleigh-Taylor instabilitiesHammouch, Zohra 30 May 2012 (has links)
L’approximation dite « anélastique » permet de filtrer les ondes acoustiques grâce à un développement asymptotique deséquations de Navier-Stokes, réduisant ainsi le pas en temps moyen, lors de la simulation numérique du développement d’instabilités hydrodynamiques. Ainsi, les équations anélastiques sont établies pour un mélange de deux fluides pour l’instabilité de Rayleigh-Taylor. La stabilité linéaire de l’écoulement est étudiée pour la première fois pour des fluides parfaits, par la méthode des modes normaux, dans le cadre de l’approximation anélastique. Le problème de Stokes issu des équations de Navier-Stokes sans les termes non linéaires (une partie de la poussée d’Archiméde est prise en compte) est défini ; l’éllipticité est démontrée, l’étude des modes propres et l’invariance liée à la pression sont détaillés. La méthode d’Uzawa est étendue à l’anélastique en mettant en évidence le découplage des vitesses en 3D, le cas particulier k = 0 et les modes parasites de pression. Le passage au multidomaine a permis d’établir les conditions de raccord (raccord Co de la pression sans condition aux limites physiques). Les algorithmes et l’implantation dans le code AMENOPHIS sont validés par les comparaisons de l’opérateur d’Uzawa développé en Fortran et à l’aide de Mathematica. De plus des résultats numériques ont été comparés à une expérience avec des fluides incompressibles. Finalement, une étude des solutions numériques obtenues avec les options anélastique et compressible a été menée. L’étude de l’influence de la stratification initiale des deux fluides sur le développement de l’instabilité de Rayleigh-Taylor est amorcée. / The « anelastic » approximation allows us to filter the acoustic waves thanks to an asymptotic development of the Navier-Stokes equations, so increasing the averaged time step, during the numerical simulation of hydrodynamic instabilitiesdevelopment. So, the anelastic equations for a two fluid mixture in case of Rayleigh-Taylor instability are established.The linear stability of Rayleigh-Taylor flow is studied, for the first time, for perfect fluids in the anelastic approximation.We define the Stokes problem resulting from Navier-Stokes equations without the non linear terms (a part of the buoyancyis considered) ; the ellipticity is demonstrated, the eigenmodes and the invariance related to the pressure are detailed.The Uzawa’s method is extended to the anelastic approximation and shows the decoupling speeds in 3D, the particular casek = 0 and the spurius modes of pressure. Passing to multidomain allowed to establish the transmission conditions.The algorithms and the implementation in the existing program are validated by comparing the Uzawa’s operator inFortran and Mathematica langages, to an experiment with incompressible fluids and results from anelastic and compressiblenumerical simulations. The study of the influence of the initial stratification of both fluids on the development of the Rayleigh-Taylor instability is initiated.
|
392 |
SIMD-aware word length optimization for floating-point to fixed-point conversion targeting embedded processors / Optimisation SIMD de la largeur des mots pour la conversion de virgule flottante en virgule fixe pour des processeurs embarquésEl Moussawi, Ali Hassan 16 December 2016 (has links)
Afin de limiter leur coût et/ou leur consommation électrique, certains processeurs embarqués sacrifient le support matériel de l'arithmétique à virgule flottante. Pourtant, pour des raisons de simplicité, les applications sont généralement spécifiées en utilisant l'arithmétique à virgule flottante. Porter ces applications sur des processeurs embarqués de ce genre nécessite une émulation logicielle de l'arithmétique à virgule flottante, qui peut sévèrement dégrader la performance. Pour éviter cela, l'application est converti pour utiliser l'arithmétique à virgule fixe, qui a l'avantage d'être plus efficace à implémenter sur des unités de calcul entier. La conversion de virgule flottante en virgule fixe est une procédure délicate qui implique des compromis subtils entre performance et précision de calcul. Elle permet, entre autre, de réduire la taille des données pour le coût de dégrader la précision de calcul. Par ailleurs, la plupart de ces processeurs fournissent un support pour le calcul vectoriel de type SIMD (Single Instruction Multiple Data) afin d'améliorer la performance. En effet, cela permet l'exécution d'une opération sur plusieurs données en parallèle, réduisant ainsi le temps d'exécution. Cependant, il est généralement nécessaire de transformer l'application pour exploiter les unités de calcul vectoriel. Cette transformation de vectorisation est sensible à la taille des données ; plus leurs tailles diminuent, plus le taux de vectorisation augmente. Il apparaît donc un compromis entre vectorisation et précision de calcul. Plusieurs travaux ont proposé des méthodologies permettant, d'une part la conversion automatique de virgule flottante en virgule fixe, et d'autre part la vectorisation automatique. Dans l'état de l'art, ces deux transformations sont considérées indépendamment, pourtant elles sont fortement liées. Dans ce contexte, nous étudions la relation entre ces deux transformations, dans le but d'exploiter efficacement le compromis entre performance et précision de calcul. Ainsi, nous proposons d'abord un algorithme amélioré pour l'extraction de parallélisme SLP (Superword Level Parallelism ; une technique de vectorisation). Puis, nous proposons une nouvelle méthodologie permettant l'application conjointe de la conversion de virgule flottante en virgule fixe et de l'exploitation du SLP. Enfin, nous implémentons cette approche sous forme d'un flot de compilation source-à-source complètement automatisé, afin de valider ces travaux. Les résultats montrent l'efficacité de cette approche, dans l'exploitation du compromis entre performance et précision, vis-à-vis d'une approche classique considérant ces deux transformations indépendamment. / In order to cut-down their cost and/or their power consumption, many embedded processors do not provide hardware support for floating-point arithmetic. However, applications in many domains, such as signal processing, are generally specified using floating-point arithmetic for the sake of simplicity. Porting these applications on such embedded processors requires a software emulation of floating-point arithmetic, which can greatly degrade performance. To avoid this, the application is converted to use fixed-point arithmetic instead. Floating-point to fixed-point conversion involves a subtle tradeoff between performance and precision ; it enables the use of narrower data word lengths at the cost of degrading the computation accuracy. Besides, most embedded processors provide support for SIMD (Single Instruction Multiple Data) as a mean to improve performance. In fact, this allows the execution of one operation on multiple data in parallel, thus ultimately reducing the execution time. However, the application should usually be transformed in order to take advantage of the SIMD instruction set. This transformation, known as Simdization, is affected by the data word lengths ; narrower word lengths enable a higher SIMD parallelism rate. Hence the tradeoff between precision and Simdization. Many existing work aimed at provide/improving methodologies for automatic floating-point to fixed-point conversion on the one side, and Simdization on the other. In the state-of-the-art, both transformations are considered separately even though they are strongly related. In this context, we study the interactions between these transformations in order to better exploit the performance/accuracy tradeoff. First, we propose an improved SLP (Superword Level Parallelism) extraction (an Simdization technique) algorithm. Then, we propose a new methodology to jointly perform floating-point to fixed-point conversion and SLP extraction. Finally, we implement this work as a fully automated source-to-source compiler flow. Experimental results, targeting four different embedded processors, show the validity of our approach in efficiently exploiting the performance/accuracy tradeoff compared to a typical approach, which considers both transformations independently.
|
393 |
Ladění výkonnosti databází / Database Performance TuningPaulíček, Martin January 2011 (has links)
The objective of this thesis was to study problems of an insufficient database processing performance and possibilities how to improve the performance with database configuration file optimizations, more powerful hardware and parallel processing. The master thesis contains a description of relational databases, storage media and different forms of parallelism with its use in database systems. There is a description of the developed software for testing database performance. The program was used for testing several database configuration files, various hardware, different database systems (PostgreSQL, Oracle) and advantages of parallel method "partitioning". Test reports and evaluation results are described at the end of the thesis.
|
394 |
Realisierung einer Schedulingumgebung für gemischt-parallele Anwendungen und Optimierung von layer-basierten SchedulingalgorithmenKunis, Raphael 20 January 2011 (has links)
Eine Herausforderung der Parallelverarbeitung ist das Erreichen von Skalierbarkeit großer paralleler Anwendungen für verschiedene parallele Systeme. Das zentrale Problem ist, dass die Ausführung einer Anwendung auf einem parallelen System sehr gut sein kann, die Portierung auf ein anderes System in der Regel jedoch zu schlechten Ergebnissen führt.
Durch die Verwendung des Programmiermodells der parallelen Tasks mit Abhängigkeiten kann die Skalierbarkeit für viele parallele Algorithmen
deutlich verbessert werden. Die Programmierung mit parallelen Tasks führt zu Task-Graphen mit Abhängigkeiten zur Darstellung einer parallelen Anwendung, die auch als gemischt-parallele Anwendung bezeichnet wird. Die Grundlage für eine effiziente Abarbeitung einer gemischt-parallelen Anwendung bildet ein geeigneter Schedule, der eine effiziente Abbildung der parallelen Tasks auf die Prozessoren des parallelen Systems vorgibt. Für die Berechnung eines Schedules werden Schedulingalgorithmen eingesetzt.
Ein zentrales Problem bei der Bestimmung eines Schedules für gemischt-parallele Anwendungen besteht darin, dass das Scheduling bereits für Single-Prozessor-Tasks mit Abhängigkeiten und ein paralleles System mit zwei Prozessoren NP-hart ist. Daher existieren lediglich Approximationsalgorithmen und Heuristiken um einen Schedule zu berechnen. Eine Möglichkeit zur Berechnung eines Schedules sind layerbasierte Schedulingalgorithmen. Diese Schedulingalgorithmen bilden zuerst Layer unabhängiger paralleler Tasks und berechnen den Schedule für jeden Layer separat.
Eine Schwachstelle dieser Schedulingalgorithmen ist das Zusammenfügen der einzelnen Schedules zum globalen Schedule. Der vorgestellte Algorithmus Move-blocks bietet eine elegante Möglichkeit das Zusammenfügen zu verbessern. Dies geschieht durch eine Verschmelzung der Schedules aufeinander folgender Layer.
Obwohl eine Vielzahl an Schedulingalgorithmen für gemischt-parallele Anwendungen existiert, gibt es bislang keine umfassende Unterstützung des Schedulings durch Programmierwerkzeuge. Im Besonderen gibt es keine Schedulingumgebung, die eine Vielzahl an Schedulingalgorithmen in sich vereint. Die Vorstellung der flexiblen, komponentenbasierten und erweiterbaren Schedulingumgebung SEParAT ist der zweite Fokus dieser Dissertation. SEParAT unterstützt verschiedene Nutzungsszenarien,
die weit über das reine Scheduling hinausgehen, z.B. den Vergleich von
Schedulingalgorithmen und die Erweiterung und Realisierung neuer Schedulingalgorithmen. Neben der Vorstellung der Nutzungsszenarien werden sowohl die interne Verarbeitung eines Schedulingdurchgangs als auch die komponentenbasierte Softwarearchitektur detailliert vorgestellt.
|
395 |
[pt] CHEIO OU VAZIO?: EFEITOS SEMÂNTICOS E SINTÁTICOS NA PRODUÇÃO DO OBJETO DIRETO ANAFÓRICO / [en] FULL OR EMPTY?: SEMANTIC AND SYNTACTIC EFFECTS IN ANAPHORIC DIRECT OBJECT CODINGROSANE FERNANDES LIRA DE OLIVEIRA 11 November 2021 (has links)
[pt] Esta tese investiga os fatores semânticos e sintáticos que afetam a codificação do
objeto direto anafórico (ODA) no português brasileiro (PB). O ODA pode ser um
DP pleno [+ definido], um clítico acusativo, um pronome tônico ou um elemento
nulo (cuja natureza é controversa na teoria linguística). Busca-se: (i) avaliar como
fatores semânticos (animacidade, especificidade e gênero conceitual), sintáticos
(função sintática) e pertinentes à interface sintaxe/semântica (papel temático)
afetam a codificação da retomada, em diferentes contextos sintáticos (sentenças
simples e ilha sintática) e/ou discursivos (respostas a perguntas QU e
complementação de narrativas curtas ou conversas informais); (ii) verificar a
influência da escolarização nas estratégias de codificação do ODA; e (iii) discutir
a natureza das formas nulas produzidas. O aporte teórico parte da concepção de
língua veiculada no Programa Minimalista (CHOMSKY,1995. 2005) e da
perspectiva de produção trazida do modelo de computação gramatical em tempo
real (CORRÊA, 2006; 2008; CORRÊA; AUGUSTO, 2007; em diante) no
tratamento das questões ligadas à acessibilidade relativa do antecedente a ser
retomado (ARIEL, 2001; ARNOLD, 2010; BOCK; WARREN, 1985; SANDERS;
GERNSBACHER, 2004), quando da codificação gramatical do enunciado
(LEVELT, 1989). Parte-se da hipótese de que a produção de ODAs é função das
condições de processamento às quais o falante está submetido e que propriedades
semânticas e sintáticas do antecedente afetam sua acessibilidade relativa, impondo
restrições à codificação de sua retomada. Seis experimentos de produção eliciada
são reportados. O contexto sintático influenciou a acessibilidade dos antecedentes,
retomados predominantemente por DPs completos entre sentenças no discurso; e
por formas mínimas (pronominais e elementos nulos), quando em sentenças
complexas. Os efeitos de animacidade e de especificidade sugerem que o pronome
tônico seja default para antecedentes acessíveis [+animado; +específico],
enquanto o nulo o é para [-animados; mais ou menos específico], corroborando achados da
literatura com produção espontânea. O gênero conceitual não foi decisivo para a
retomada anafórica, mas pareceu aumentar a especificidade de antecedentes cujo
gênero conceitual era conhecido. O papel temático, por si só, não é decisivo para a
forma da retomada anafórica. Entretanto, a possibilidade de o elemento nulo
recuperar um fato/evento descrito anteriormente o compatibiliza com uma
alternativa ao clítico sentencial. O grau de escolaridade dos participantes elevou
as taxas de clíticos acusativos, especialmente com antecedentes [+animado]
(como alternativa aos pronomes tônicos), evidenciando a interferência da língua
escrita sobre a língua falada, bem como a produtividade dessa forma para falantes
com alto grau de escolaridade. A função sintática do antecedente não interferiu no
ODA. A ocorrência do elemento nulo em contextos de ilha corrobora a visão de
que este não seja uma variável no PB. À luz do modelo de computação em tempo
real, considera-se que as condições de acesso do antecedente determinam a
natureza da forma nula: se a representação da estrutura sintática do antecedente se
mantiver ativa na memória de trabalho, este pode ser recuperado como uma
elipse, a ser restaurada na interface semântica; se apenas seus traços phi ou a
representação semântica de seu antecedente são acessíveis, ODA é codificado
como pro. / [en] This thesis investigates the semantic and syntactic factors that affect the encoding
of the anaphoric direct object (ADO) in Brazilian Portuguese (BP). The ADO can
be a full DP [+definite], an accusative clitic, a stressed pronoun, or a null element
(whose nature is controversial in linguistic theory). This research aims to: (i)
investigate how the semantic properties (animacy, specificity and conceptual
genre) of the antecedent, its syntactic function and factors pertaining to the
syntax/semantic interface (thematic role) affect the encoding of the ADO in
different syntactic contexts (simple sentences and syntactic island) and/or
discourse (answers to WH-questions and continuations of short narratives or
informal conversations); (ii) verify the influence of schooling in the strategies of
ADO encoding; and (iii) discuss the nature of the null forms produced. The
theoretical background incorporates the conception of language conveyed in the
Minimalist Program (CHOMSKY, 1995; 2005) and an approach to issues
regarding the relative accessibility of the antecedent to be resumed (ARIEL,
2001; ARNOLD, 2010; BOCK; WARREN, 1985; SANDERS; GERNSBACHER,
2004) in the grammatical encoding of a sentence (LEVELT, 1989), in the light of
an on-line model of grammatical computation (CORREA, 2006; 2008; CORREA;
AUGUSTO, 2007) The working hypothesis is that the production of the ODA is a
function of particular processing conditions and that the semantic and syntactic
properties of the antecedent affect its relative accessibility, imposing restrictions
on its resumption. Six elicited production experiments are reported. The syntactic
context influenced the accessibility of the antecedents, predominantly recovered
by full DPs, when between-sentences in the discourse; and by minimal forms
(pronominals and null elements) in complex sentences. The effects of animacy
and of specificity corroborate spontaneous production data, suggesting that the
full pronoun is the default option for [+animated; +specific], while the null form is
the default option for [-animated; +- specific] antecedents. The conceptual genre of
the antecedent was not decisive for a particular form of encoding, but it seemed to
enhance the specificity of the antecedent whose conceptual gender was known.
The thematic role, by itself, does not determine the form of anaphoric resumption.
However, the possibility of the null resumption of an fact/event previously
mentioned makes it compatible with an alternative to the sentential clitic.
Schooling increased the rates of accusative clitics, especially with [+animated]
antecedents (as an alternative to tonic pronouns), showing the interference of the
written language on the spoken language, as well as the productivity of this form
for educated speakers. The syntactic function of the antecedent did not affect
ADO production. The occurrence of the null element in island contexts corroborates the view that the null element is not a variable in BP. It is argued, in the light of the on-line model, that the accessibility of the antecedent determines the nature of null element: if the representation of the syntactic structure of the antecedent is still active in working memory, it can be retrieved as an ellipsis, to be restored at the semantic interface; if it is the phi features of the antecedent or the semantic representation of its referent that remain available, ADO is encoded as a pro.
|
396 |
Deep Learning Inference on Low-Power Commodity Processors and the AMD Versal AI EngineLei, Jie 18 November 2024 (has links)
[ES] Esta tesis presenta un estudio exhaustivo sobre la implementación de una realización eficiente de GEMM en procesadores de bajo consumo y en una plataforma heterogénea de AMD. Esta investigación está inspirada por la creciente demanda de inferencias de bajo consumo, baja latencia y alto rendimiento con modelos complejos de Deep Learning (DL) que surgen, por ejemplo, en Natural Language Processing (NLP) y Convolutional Neural Networks (CNN). Esto llevó a la oportunidad de explorar la aplicabilidad de la aceleración de hardware y software para GEMM en plataformas ARM, RISC-V y AMD Versal AI Engine (AIE).
Establecimos los objetivos de nuestra investigación de la siguiente manera: Primero, desarrollar kernels de precisión mixta eficientes para GEMM en arquitecturas ARM y RISC-V explotando las unidades Single-Instruction, Multiple-Data (SIMD) en estas arquitecturas. En segundo lugar, explorar la aplicabilidad del algoritmo convencional para GEMM en plataformas de hardware no convencionales como el AIE en el sistema AMD Versal. Por último, investigar la escalabilidad del diseño paralelo de GEMM a múltiples AIE en sistemas AMD Versal.
En mayor detalle, la investigación comienza implementando GEMM en las arquitecturas ARM y RISC-V, donde propusimos una herramienta de generación de código de micro-kernels basada en plantillas para ARM Neon, la extensión vectorial RISC-V (RVV) 0.7.1 y RVV 1.0. La herramienta de generación de código también permite configurar las dimensiones del micro-kernel, un parámetro crítico desde el punto de vista del rendimiento. Este trabajo indica que esta generación de código de kernels mejoró drásticamente la productividad y la portabilidad de los diseños de GEMM basados en intrínsecos. También incorporamos aritmética de precisión mixta INT8|INT32, mostrando la aceleración sobre los enfoques FP32.
Basándonos en el éxito de la implementación de GEMM en sistemas convencionales de bajo costo, extendimos nuestros intereses a plataformas heterogéneas no convencionales, en particular, la arquitectura AMD Versal AIE. Para esta plataforma, diseñamos micro-kernels específicos de la arquitectura de 8x8 utilizando intrínsecos flexibles de bajo nivel, implementando aritmética de precisión mixta y rutinas de empaquetado de datos, todo orientado a la inferencia de DL de alto rendimiento. Más importante aún, propusimos un diseño de jerarquía de memoria personalizada para esta arquitectura, crucial para operaciones de GEMM de baja latencia. Los resultados muestran que los micro-kernels propuestos lograron el 86.7% del rendimiento máximo de la implementación de un solo AIE. Fuimos un paso más allá al evaluar el diseño de GEMM en el modelo de DL ResNet-50 v1.5+ImageNet, donde convertimos los operadores de convolución a kernels de GEMM.
Tras la implementación exitosa de GEMM en un solo tile de AIE, extendimos nuestra investigación a múltiples tiles de AIE, donde introdujimos la paralelización en el algoritmo. Rediseñamos el GEMM específico de la arquitectura acomodando hasta 32 tiles de AIE. Para lograr esto, optimizamos el diseño de la jerarquía de memoria personalizada y propusimos una nueva topología para un mayor rendimiento de comunicación. Los resultados muestran una gran escalabilidad del diseño paralelo de GEMM, reduciendo drásticamente el tiempo de computación en 31.5x en comparación con el diseño de un solo tile de AIE. / [CA] Aquesta tesi presenta un estudi complet sobre la implementació d'una realització eficient de GEMM en processadors de baix consum i una plataforma heterogènia d'AMD. Aquesta investigació s'inspira en la creixent demanda d'inferències de baix consum, baixa latència i alt rendiment amb models complexos de Deep Learning (DL), com per exemple, en Natural Language Processing (NLP) i Convolutional Neural Networks (CNN). Això va portar a l'oportunitat d'explorar l'aplicabilitat de l'acceleració de maquinari i programari per a GEMM en plataformes ARM, RISC-V i AMD Versal AI Engine (AIE).
Els objectius de la nostra investigació són els següents: En primer lloc, desenvolupar nuclis de precisió mixta eficients per a GEMM en arquitectures ARM i RISC-V explotant les unitats Single-Instruction, Multiple-Data (SIMD) en aquestes arquitectures. En segon lloc, explorar l'aplicabilitat de l'algorisme convencional per a GEMM en plataformes de maquinari no convencionals com l'AIE en el sistema AMD Versal. Finalment, investigar l'escalabilitat del disseny paral·lel de GEMM a múltiples AIE en sistemes AMD Versal.
En més detall, la investigació comença implementant GEMM en arquitectures ARM i RISC-V, on hem proposat una eina de generació de codi de micro-nuclis basada en plantilles per a ARM Neon, l'extensió vectorial RISC-V (RVV) 0.7.1 i RVV 1.0. L'eina de generació de codi també permet configurar les dimensions del micro-nucli, un paràmetre crític des del punt de vista del rendiment. Aquest treball indica que aquesta generació de codi de nucli va millorar dràsticament la productivitat i portabilitat dels dissenys de GEMM basats en intrínsecs. També incorporem aritmètica de precisió mixta INT8|INT32, mostrant la millora de velocitat respecte als enfocaments FP32.
Sobre la base de l'èxit de la implementació de GEMM en sistemes convencionals de consum, vam ampliar els nostres interessos a arquitectures heterogènies no convencionals, en particular, l'arquitectura AMD Versal AIE. Per a aquesta plataforma, vam dissenyar micro-nuclis específics d'arquitectura de 8x8 utilitzant els intrínsecs de baix nivell flexibles, implementant aritmètica de precisió mixta i rutines d'embalatge de dades, totes destinades a inferència de DL d'alt rendiment. Més important encara, vam proposar un disseny de jerarquia de memòria personalitzat per a aquesta arquitectura, que és crucial per a operacions GEMM de baixa latència. Els resultats mostren que els micro-nuclis proposats van aconseguir el 86,7% del rendiment màxim d'una implementació d'AIE única. Vam anar un pas més enllà avaluant el disseny de GEMM en el model de DL ResNet-50 v1.5+ImageNet, on vam convertir els operadors de convolució en nuclis GEMM.
Després de la implementació exitosa de GEMM en una sola rajola AIE, vam ampliar la nostra investigació a múltiples rajoles AIE, on vam introduir la paral·lelització a l'algorisme. Vam redissenyar el GEMM específic d'arquitectura per a acomodar fins a 32 rajoles AIE. Per aconseguir-ho, vam optimitzar el disseny de la jerarquia de memòria personalitzada i vam proposar una nova topologia per a un major ample de banda de comunicació. / [EN] This thesis presents a comprehensive study on implementing an efficient realization of GEMM on low-power commodity processors and a heterogeneous platform from AMD. This research is inspired by the increasing demand for low-power, low-latency, high-performance inference with complex Deep Learning (DL) models arising, for instance, in Natural Language Processing (NLP) and Convolutional Neural Networks (CNN). This led to the opportunity to explore the applicability of hardware and software acceleration for GEMM on ARM, RISC-V, and AMD Versal AI Engine (AIE) platforms.
We set up the objectives of our research as follows: Firstly, to develop efficient mixed precision kernels for GEMM on ARM and RISC-V architectures exploiting the Single-Instruction, Multiple-Data (SIMD) units in these architectures. Secondly, to explore the applicability of the conventional algorithm for GEMM to non-conventional hardware platforms such as the AIE in the AMD Versal system. Lastly, to investigate the scalability of the parallel design of GEMM to multiple AIEs on AMD Versal systems.
In greater detail, the research starts by implementing GEMM on ARM and RISC-V architectures, where we proposed template-based micro-kernels code generation tool for ARM Neon, RISC-V vector (RVV) extension 0.7.1, and RVV 1.0. The code generation tool also allows configuring the micro-kernel dimensions, a critical parameter from the point of performance. This work indicates this kernel code generation drastically improved the productivity and portability of intrinsic-based GEMM designs. We also incorporate mixed-precision INT8|INT32 arithmetic, showing the speedup over FP32 approaches.
Building upon the success of GEMM implementation on conventional commodity systems, we extended our interests to non-conventional heterogeneous platforms, in particular, the AMD Versal AIE architecture. For this platform, we designed architecture-specific 8x8 micro-kernels utilizing the flexible low-level intrinsic, implementing mixed-precision arithmetic and data-packing routines, all aimed for high-performance DL inference. More importantly, we proposed a customized memory hierarchy design for this architecture, which is crucial for low-latency GEMM operations. The results show that the proposed micro-kernels achieved 86.7% of the peak performance of a single AIE implementation. We went a step further by benchmarking the GEMM design on the DL model ResNet-50 v1.5+ImageNet, where we converted the convolution operators to GEMM kernels.
Following the successful implementation of GEMM on a single AIE tile, we extended our research to multiple AIE tiles, where we introduced parallelization to the algorithm. We redesigned the architecture-specific GEMM accommodating up to 32 AIE tiles. To achieve this, we optimized the customized memory hierarchy design and proposed a new topology for higher communication throughput. The results show great scalability of the parallel GEMM design, drastically reducing computational time by 31.5x compared to the single AIE tile design. / I would like to express my sincere appreciation to Horizon 2020 of the European Union for their generous funding. This project has been supported by the European Union’s Horizon 2020 (H2020) Marie Sklodowska-Curie Innovative Training Networks H2020-MSCA-ITN-2020
call, under Grant Agreement no. 956090. This funding has been crucial in enabling the success of this research. / Lei, J. (2024). Deep Learning Inference on Low-Power Commodity Processors and the AMD Versal AI Engine [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/212297
|
397 |
Psaný hlas: Whitmanovy Listy trávy (1855) a Millerův Obratník Raka / Written Voice: Whitman's Leaves of Grass (1855) and Miller's Tropic of CancerSkovajsa, Ondřej January 2014 (has links)
The PhD. dissertation Written Voice examines how Walt Whitman and Henry Miller through books, confined textual products of modernity, strive to awaken the reader to a more perceptive and courageous life, provided that the reader is willing to suspend hermeneutics of suspicion and approach Leaves of Grass and Tropic of Cancer with hermeneutics of hunger. This is examined from linguistic, anthropological and theological vantage point of oral theory (M. Jousse, M. Parry, A. Lord, W. Ong, E. Havelock, J. Assmann, D. Abram, C. Geertz, T. Pettitt, J. Nohrnberg, D. Sölle, etc.). This work thus compares Leaves (1855) and Tropic of Cancer examining their paratextual, stylistic features, their genesis, the phenomenology of their I's, their ethos and story across the compositions. By "voluntary" usage of means of oral mnemonics such as parallelism/bilateralism (Jousse) - along with present tense, imitatio Christi and pedagogical usage of obscenity - both authors in their compositions attack the textual modern discourse, the posteriority, nostalgia and confinement of literature, restore the body, and aim for futurality of biblical kinetics. It is the reader's task, then, to hermeneutically resurrect the dead printed words of the compositions into their own "flesh" and action. The third part of the thesis...
|
398 |
Zobrazování voxelových scén pomocí ray tracingu v reálném čase / Rendering of Voxel-Based Scenes Using Real-Time Ray TracingMenšík, Jakub January 2021 (has links)
The aim of this work was to create a program to visualize voxel scenes in real time using ray tracing. It included the study of various methods of such a rendering with a focus on shadows. The solution was created using Unity engine and experimental packages Unity Jobs and Burst. The thesis presents multiple ray tracing passes and SVGF technique, that is used to turn a noisy input into full edge-preserving image. The final program is able to render hard shadows, soft shadows, and ambient occlusion at speed of fifty frames per second.
|
Page generated in 0.0551 seconds