If you would like to be added to the mailing list for seminar announcements, or if you would like to give a PARKAS seminar, please email Timothy Bourke.
An iCalendar is available.
We focus on the automated synthesis of divide-and-conquer parallelism, which is a common parallel programming skeleton supported by many cross-platform multithreaded libraries. The challenges of producing (manually or automatically) a correct divide-and-conquer parallel program from a given sequential code are two-fold:
assuming that individual worker threads execute a code identical to the sequential code, the programmer has to provide the extra code for dividing the tasks and combining the computation results, and
sometimes, the sequential code may not be usable as is, and may need to be modified by the programmer.
We address both challenges in this paper. We present an automated synthesis technique for the case where no modifications to the sequential code are required, and we propose an algorithm for modifying the sequential code to make it suitable for parallelization when some modification is necessary. Finally, we present theoretical results for when this modification is efficiently possible, and experimental evaluation of the technique and the quality of the produced parallel programs.
(This work will appear in PLDI 2017.)
PhD thesis defense.
There will be a “pot-de-thèse” after the defense at 45 rue d'Ulm. Please RSVP to Robin by email if you will attend.
Modern multiprocessor architectures and programming languages exhibit weakly consistent memories. Their behaviour is formalised by the memory model of the architecture or programming language; it precisely defines which write operation can be returned by each shared memory read. This is not always the latest store to the same variable, because of optimisations in the processors such as speculative execution of instructions, the complex effects of caches, and optimisations made by compilers.
In this thesis we focus on the C11 memory model that is defined by the 2011 edition of the C standard. Our contributions are threefold.
First, we focused on the theory surrounding the C11 model, formally studying which compiler optimisations it enables. We show that many common compiler optimisations are allowed, but, surprisingly, some important ones are forbidden.
Secondly, building on our results, we developed a random testing methodology for detecting when mainstream compilers such as GCC or Clang perform an incorrect optimisation with respect to the memory model. We found several bugs in GCC, all promptly fixed. We also implemented a novel optimisation pass in LLVM, that looks for special instructions that restrict processor optimisations—called fence instructions—and eliminates the redundant ones.
Finally, we developed a user-level scheduler for lightweight threads communicating through first-in first-out single-producer single-consumer queues. This programming model is known as Kahn process networks, and we show how to efficiently implement it, using C11 synchronisation primitives. This shows that despite its flaws, C11 can be usable in practice.
The members of the thesis committee are:
PhD thesis defense. The room is on the very top floor, see this map. There will be a “pot-de-thèse” downstairs after the defense in Salle S16 in the Rataud Wing at the ENS. The manuscript and supporting materials are available online.
In this thesis we study embedded controllers implemented as sets of unsynchronized periodic processes. Each process activates quasi-periodically, that is, periodically with bounded jitter, and communicates with bounded transmission delays. Such reactive systems, termed quasi-periodic, exist as soon as two periodic processes are connected together. In the distributed systems literature they are also known as synchronous real-time models. We focus on techniques for the design and analysis of such systems without imposing a global clock synchronization.
Synchronous languages were introduced as domain specific languages for the design of reactive systems. They offer an ideal framework to program, analyze, and verify quasi-periodic systems. Based on a synchronous approach, this thesis makes contributions to the treatment of quasi-periodic systems along three themes: verification, implementation, and simulation.
Verification: The quasi-synchronous abstraction is a discrete abstraction proposed by Paul Caspi for model checking safety properties of quasi-periodic systems. We show that this abstraction is not sound in general and give necessary and sufficient conditions on both the static communication graph of the application and the real-time characteristics of the architecture to recover soundness. We then generalize these results to multirate systems.
Implementation: Loosely time-triggered architectures are protocols designed to ensure the correct execution of an application running on a quasi-periodic system. We propose a unified framework that encompasses both the application and the protocol controllers. This framework allows us to simplify existing protocols, propose optimized versions, and give new correctness proofs. We instantiate our framework with a protocol based on clock synchronization to compare the performance of the two approaches.
Simulation: Quasi-periodic systems are but one example of timed systems involving real-time characteristics and tolerances. For such nondeterministic models, we propose a symbolic simulation scheme inspired by model checking techniques for timed automata. We show how to compile a model mixing nondeterministic continuous-time and discrete-time dynamics into a discrete program manipulating sets of possible values. Each trace of the resulting program captures a set of possible executions of the source program.
The members of the thesis committee are:
Many computationally-intensive algorithms benefit from the wide parallelism offered by Graphical Processing Units (GPUs). However, the search for a close-to-optimal implementation remains extremely tedious due to the specialization and complexity of GPU architectures.
We present a novel approach to automatically discover the best performing code from a given set of possible implementations. It involves a branch and bound algorithm with two distinctive features: (1) an analytic performance model of a lower bound on the execution time, and (2) the ability to estimate such bounds on a partially-specified implementation.
The unique features of this performance model allow to aggressively prune the optimization space without eliminating the best performing implementation. While the space considered in this paper focuses on GPUs, the approach is generic enough to be applied to other architectures.
We implemented our algorithm in a tool called Telamon} and demonstrate its effectiveness on a huge, architecture-specific and input-sensitive optimization space.
This will be a practice run for Ulysse's presentation at CC 2017.
We show how partial redundancy elimination (PRE) can be instantiated to perform provably correct fence elimination for multi-threaded programs running on top of the x86, ARM and IBM Power relaxed memory models. We have implemented our algorithm in the backends of the LLVM compiler infrastructure. The optimisation does not induce an observable overhead at compile-time and can result in up-to 10% speedup on some benchmarks.
This will be a practice run for Robin and Francesco's presentation at CC 2017.
PhD thesis defense. The room is on the ground floor, at number 14 on this map.
In this thesis, we are interested in Kahn process networks, a simple yet expressive model of concurrency, and its parallel implementation on modern shared-memory architectures. Kahn process networks expose concurrency to the programmer through an arrangement of sequential processes and single-producer single-consumer channels.
The focus is on the implementation aspects. Of particular importance to our study are two parameters: lock freedom and relaxed memory. The development of fast and efficient lock-free algorithms ties into concerns of controlled resource consumption (important in embedded systems) and reliable performance on current and future platforms with unfair or skewed scheduling such as virtual machines and GPUs. Our work with relaxed memory models complements this more theoretical approach by offering a window into realistic shared-memory architectures.
We present a new lock-free algorithm for a Kahn process network interpreter. It is disjoint-access parallel: we allow multiple threads to work on the same shared Kahn process network, fully utilizing the parallelism exhibited by independent processes. It is non-blocking in that it guarantees global progress in bounded memory, even in the presence of (possibly infinite) delays affecting the executing threads. To our knowledge, it is the first lock-free system of this size, and integrates various well-known non-blocking techniques and concepts (e.g., safe memory reclamation, multi-word updates, assistance) with ideas and optimizations specific to the Kahn network setting. We also discuss a blocking variant of this algorithm, targetted at high-performance computing, with encouraging experimental results.
The members of the thesis committee are:
Even after thirty years, the polyhedral model is far from being an unequivocal success, even on the restricted domain where it is applicable. We do not (yet) have compilers for general-purpose processors that can generate code approaching either the machine peak, or the algorithmic peak of the source program, or hand tuned libraries developed by “heroes of high performance computing.” I will try to explain why this is so, and suggest a plan of how we can improve this state of affairs.
Graphics Processing Units (GPUs) are an attractive target for graph algorithms because they support massively parallel execution and possess much higher memory bandwidths than CPUs and FPGAs. However, graph programs are also highly irregular and low-level GPU programming models make writing high-performance implementations very challenging.
In this talk, I will describe the Galois GPU compiler that we have built to generate high-performance implementations of graph algorithms on GPUs. Beginning with a high-level representation of graph algorithms, which we call IrGL, our compiler generates optimized CUDA code for NVIDIA GPUs by applying three key optimizations that I argue are fundamental to high-performance implementations of graph algorithms on GPUs. Evaluation on eight graph applications shows that our compiler produces code that outperforms all current state-of-the-art handwritten CUDA implementations.
Graph program performance strongly depends on the inputs, and this makes picking the right set of optimizations for a given input very difficult. We show how our compiler can exploit queuing network models generated from IrGL source code to customize optimizations for combinations of applications and inputs.
Parts of this work will appear in OOPSLA 2016.
In this talk, I consider temporally causal functions of timed streams, i.e., functions that produce output values at a given instant that only depend on the input values that have been received till that instant. Defining partial timed streams that form a directed complete partial order (DCPO), I will show how causal functions are nicely captured as limits of continuous and synchronous functions over these DCPO offering thus a denotational model of causal timed functions. Relaxing the synchronous hypothesis to a pre-synchronous hypothesis, I will show how every causal function admits a lattice of possible denotations such that:
The categorical properties of these continuous functions will then be detailed, offering various relevant operators for programing with them. Lastly, I will show that a notion of causal function residual is available so that a minimal (possibly continuous) IO-automaton can be associated to every causal function, providing thus an abstract operational (latest) semantics. Open perspectives and questions will conclude this presentation.
Contract-based software development is a leading methodology for the construction of safety- and mission-critical embedded systems. Contracts are an effective way to establish boundaries between components and can be used efficiently to verify global properties by using compositional reasoning techniques. A contract specifies the assumptions a component makes on its context and the guarantees it provides. Requirements in the specification of a component are often case-based, with each case describing what the component should do depending on a specific situation (or mode) the component is in. This talk introduces CoCoSpec, a mode-aware assume-guarantee-based contract language for embedded systems built as an extension of the Lustre language. CoCoSpec lets users specify mode behavior directly, instead of encoding it as conditional guarantees, thus preventing a loss of mode-specific information. Mode-aware model checkers supporting CoCoSpec can increase the effectiveness of the compositional analysis techniques found in assume-guarantee frameworks and improve scalability. Such tools can also produce much better feedback during the verification process, as well as valuable qualitative information on the contract itself. I will presents the CoCoSpec language and illustrate the benefits of mode-aware model-checking on a case study involving a flight-critical avionics system. The evaluation uses Kind 2, a collaborative, parallel, SMT-based model checker developed at the University of Iowa that provides full support for CoCospec.
Real-time stream processing applications executed on embedded multiprocessor systems often have strict throughput and latency constraints. Violating these constraints is undesired and temporal analysis methods are therefore used to prevent such violations. These analysis methods use abstractions of the analyzed applications to simplify their temporal analysis.
Refinement theories have enabled the creation of deterministic abstractions of stream processing applications that are executed on multiprocessor systems. Prominent examples of such abstract models are deterministic timed-dataflow models which can be efficiently analyzed because they only have one behavior.
An important aspect of a stream processing application can be that it makes use of reordered data streams between tasks. An example is the bit-reversed ordered stream produced by a FFT task. Another example is reordering that is a result of data-parallelism. However, existing abstraction/refinement theories do not support such reordering behavior or do not handle this type of behavior correctly. This is because existing refinement theories assume that the temporal behavior of applications is orthogonal to their functional behavior, whereas this orthogonality does not always hold in the case of reordered data streams.
In this talk I will present a new refinement theory in which the potential interaction between temporal and functional behavior is taken into account. The introduced theory supports reordering of data and can therefore be used to validate existing systems with such reordering. Furthermore, the theory enables showing that deterministic dataflow models that do not apply reordering can be used as valid abstractions of systems in which reordering takes place.
Note 1: Marco will also serve on the thesis committee of Khanh Xuan Do, CEA List, NanoInnov, Monday October 17 at 14:30.
Note 2: we will continue with the seminar of Adrien Champion in the same room after a short coffee break.
Q*cert is a query compiler written in Coq. Initially dedicated to the compilation of jRule to Cloudant, it is now able to compile several source languages (e.g., OQL, NRA) and to generate code for Java or Spark. Internally, Q*cert uses 14 languages and 27 unitary translation functions to go directly from one language to another. In this talk, I will present how we handle all these compilation paths and how proofs help to maintain the options of the command line compiler.
Joint work with Joshua Auerbach, Martin Hirzel, Avraham Shinnar and Jérôme Siméon.
Travail réalisé en collaboration avec Joshua Auerbach, Martin Hirzel, Avraham Shinnar et Jérôme Siméon.
Writing code which uses an architecture at its full capability has become an increasingly difficult problem. For some key operations, a dedicated accelerator or a finely tuned implementation exists and delivers the best performance. When compiling code, identifying these operations and issuing calls to their high-performance implementations is thus attractive. In this talk, we focus on the problem of detecting such operations. We propose a framework which detects linear algebra subcomputations within a polyhedral program. The main idea of this framework is to partition the computation in order to isolate different subcomputations in a regular manner, then we consider each portion of the computation and try to recognize it as a combination of linear algebra operations.
We perform the partitioning of the computation by using a program transformation called monoparametric tiling. This transformation partitions the computation into blocks, whose shape is some homothetic scaling of a fixed-size partitioning. We show that the tiled program remains polyhedral while allowing a limited amount of parametrization: a single size parameter. This is an improvement compared to the previous work on tiling, that forced a choice between these two properties.
Then, in order to recognize computations, we introduce a template recognition algorithm built built on a state-of-the-art program equivalence algorithm. We also propose several extensions in order to manage some semantic properties.
Finally, we combine these two contributions into a framework which detects linear algebra subcomputations. A part of this framework is a library of templates based on the BLAS specification. We demonstrate our framework on several applications.
Achieving high performance is no easy task. From the early days of computing, many researchers have spent their lifetime trying to extract more FLOPS out of critical codes. Hardcore performance engineers try to get to this performance nirvana single handedly without any help from languages, compilers or tools. In this talk I'll argue that programming languages and compiler technology can take on most of the performance optimization burden... at least in certain domains. We will demonstrate the performance optimization ability of compilers and languages in the domains of image processing and physical simulation. Most these applications are performance critical, and programmers spend many months optimizing them. In the first part of this talk, I will describe Halide, a language and compiler for image processing. Halide explicitly separates the description of the algorithm, the task of the domain expert, from the choices of execution structure which determine parallelism, locality, memory footprint, and synchronization, the domain of the performance engineer. We show how this separation simplifies the problem for both the domain expert and the performance engineer. Using machine learning we can even eliminate the performance engineer entirely from the process. The end result is much simpler programs, delivering performance often many times faster than the best prior hand-tuned C, assembly, and CUDA implementations, while scaling across radically different architectures, from ARM cores to massively parallel GPUs.
In the second part of the talk, I will introduce Simit, a new language for physical simulations. Simit lets the programmer seamlessly work on a physical system both in its individual geometric elements as a hypergraph as well as the behavior of the entire system as set of global tensors. Simit provides a novel assembly construct that makes it conceptually easy and computationally efficient to move between these two abstractions. Using the information provided by the assembly construct, the compiler generates efficient in-place computation on the graph. We demonstrate that Simit is easy to use: a Simit program is typically shorter than a Matlab program; that it is high performance: a Simit program running sequentially on a CPU performs comparably to hand-optimized simulations; and that it is portable: Simit programs can be compiled for GPUs with no change to the program, delivering 4 to 20x speedups over our optimized CPU code.
Bio: Saman P. Amarasinghe is a Professor in the Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) where he leads the Commit compiler group. Under Saman's guidance, the Commit group has developed the StreamIt, PetaBricks, StreamJIT, Halide, Simit, and MILK programming languages and compilers, DynamoRIO dynamic instrumentation system, Superword level parallelism for SIMD vectorization, Program Shepherding to protect programs against external attacks, the OpenTuner extendable autotuner, and the Kendo deterministic execution system. He was the co-leader of the Raw architecture project. His research interests are in discovering novel approaches to improve the performance of modern computer systems and make them more secure without unduly increasing the complexity faced by the end users, application developers, compiler writers, or computer architects. Saman was the founder of Determina Corporation and a co-founder of Lanka Internet Services Ltd. Saman received his BS in Electrical Engineering and Computer Science from Cornell University in 1988, and his MSEE and Ph.D from Stanford University in 1990 and 1997, respectively.
We will present recent developments in Wagner, a functional programming language geared towards real-time digital audio processing with formal, machine-checkable semantics.
Wagner can be seen as either a very simple synchronous language of a functional reactive one, using coeffects for history tracking, allowing a natural coding of filter and digital waveguide oscillators.
In order to relate programs with their mathematical model three main components are needed:
A Coq library with some facts about the Discrete Fourier Transform and unitary transforms. The formalization is constructive and heavily relies on the Mathematical Components library.
An embedding of Wagner into Coq, crucially allowing a lightweight mechanization style as we can reuse most of Mathcomp's linear algebra libraries. The functional nature of Wagner allows the use of typical PL techniques such as for example logical relations to characterize the subset of Wagner programs that are “linear processors”.
Finally, a call-by-value abstract machine with a heap provides the efficient execution model. Basic ideas from the focusing literature are used to distinguish “positive” and “negative” types.
We embed the machine in Coq using the standard monadic CBV translation and we link it to the semantics in 2) by relating well-formed heaps with typing environments.
Search heuristics are of utter importance for solving Constraint Satisfaction Problems (CSP) efficiently. However, the implementers will struggle to choose between a full-fledged but low-level constraint library or high-level languages proposing a limited amount of pre-built heuristics. We introduce Spacetime Programming (STP), a paradigm based on concurrent constraint programming and synchronous languages, where both time and space are part of the language. The variables are characterized by a “spacetime” property describing how their values evolve through time and space. We will present preliminary work demonstrating how STP, in the context of CSP solving, can be used for writing and composing search strategies.
Equation-based object-oriented (EOO) modeling languages such as Modelica provide a convenient, declarative method for describing models of cyber-physical systems. Because of the ease of use of EOO languages, large and complex models can be built with limited effort.
However, current state-of-the-art tools do not provide the user with enough information when errors appear or simulation results are wrong. It is of paramount importance that such tools should give the user enough information to correct errors or understand where the problems that lead to wrong simulation results are located. However, understanding the model translation process of an EOO compiler is a daunting task that not only requires knowledge of the numerical algorithms that the tool executes during simulation, but also the complex symbolic transformations being performed.
As part of this work, methods have been developed and explored where the EOO tool, an enhanced Modelica compiler, records the transformations during the translation process in order to provide better diagnostics, explanations, and analysis. This information is used to generate better error-messages during translation. It is also used to provide better debugging for a simulation that produces unexpected results or where numerical methods fail.
Meeting deadlines is particularly important for real-time applications. It is usually essential to identify possible bottlenecks and either simplify the model or give hints to the compiler that enable it to generate faster code. When profiling and measuring execution times of parts of the model the recorded information can also be used to find out why a particular system model executes slowly.
Combined with debugging information, it is possible to find out why this system of equations is slow to solve, which helps understanding what can be done to simplify the model. A tool with a graphical user interface has been developed to make debugging and performance profiling easier. Both debugging and profiling have been combined into a single view so that performance metrics are mapped to equations, which are mapped to debugging information.
The algorithmic part of Modelica was extended with meta-modeling constructs (MetaModelica) for language modeling. In this context a quite general approach to debugging and compilation from (extended) Modelica to C code was developed. That makes it possible to use the same executable format for simulation executables as for compiler bootstrapping when the compiler written in MetaModelica compiles itself.
Bio: Martin Sjölund is doing a postdoc at Linköping University where he finished his PhD in 2015. The thesis is mostly focused on debugging and profiling of an equation-based object-oriented (EOO) language called Modelica. He has worked on various parts of the OpenModelica compiler since 2009, both doing research (resulting in the PhD) and development work for the Open Source Modelica Consortium (resulting in improved portability, performance, scalability, and new features in OpenModelica).
Context: part of the PhD thesis work of Oleksandr Zinenko.
The Polyhedral model is a powerful framework for program transformation, yet it remains complex for direct use, even by experts. We leverage the inherent geometric nature of the model to visualize programs and to support code transformation by directly manipulating these visualizations with automatic code generation and feedback on semantics preservation. Besides manual transformation, the user of our interactive tool, Clint, is able to recover, visually replay and adjust an automatically computed optimization using a set of syntactic-level directives with strong semantics. Combining this visualization with existing polyhedral tools introduces visual aid for schedule and transformation construction. Empirical user studies demonstrated that visual representations allow to extract and express parallelism faster and with more success.
Context: presentation of the PhD thesis work of Ivan Llopard; the thesis was prepared at CEA Leti and will be defended on April 26th in Grenoble.
Modeling languages propose convenient abstractions and transformations to harness the complexity of today's embedded systems. These languages are often based on the formalism of Hierarchical State Machines, such as the widely known Statecharts. While they naturally capture the concurrency of control systems, they face two important challenges when it comes to modeling data-intensive applications on modern parallel architectures: the absence of a unified approach that also accounts for (data) parallel actions; and the lack of an effective optimization and code generation flow. We propose a modeling language extended with parallel action semantics and indexed state machines suitable for computationally intensive applications. Together with its formal semantics, we present an optimizing compiler aiming for the generation of efficient parallel implementations. Following a model-driven methodology, the compilation flow has been successfully applied to manycore architectures such as GPGPU platforms.
Synchronous Dataflow graphs, an equivalent of weighted petri-nets introduced by Lee and Messerschmitt in 1987, is a well-known formalism commonly used to model data-exchanges between parallel programs. This model has been extensively studied in the last two decades because of the importance of its applications. However, the determination of a maximal throughput is a difficult question, for which no polynomial time algorithm exists up to now. Several authors proved that it exists a schedule which reaches the maximum throughput but with a no polynomially bounded size. On the other hand a strictly periodic schedule may be built polynomially, but without any guarantee on the throughput achieved. In this context, we investigate the trade-off between schedules size and their corresponding throughput. Additionally, the applicability of these models on SLAM algorithms is explored.
Les systèmes embarqués sont des dispositifs informatisés dédiés à une tâche spécifique, typiquement le contrôle/commande d'un processus physique. On en trouve dans le domaine des transports (freinage assisté, allumage moteur, commande de vol), du contrôle de processus industriel, ou de la production d'énergie. Les termes de systèmes réactifs, systèmes temps-réel (dur) couvrent à peu près la même réalité.
Une caractéristique principale de ces systèmes est la criticité : les défaillances peuvent être catastrophiques. Il faut donc s'assurer de la correction de ses systèmes avant leur mises en service.
Pour être considéré comme correct, un tel système doit essentiellement : - calculer juste : il réalise bien la fonctionnalité attendue, sans bogue de conception ni erreur à l'exécution ; - calculer (suffisamment) vite : la vitesse de réaction du système doit être en adéquation avec la dynamique du processus physique qu'il contrôle.
L'approche synchrone, formalisée dans le courant des années 80, s'intéresse spécifiquement aux aspects fonctionnels. Elle propose un cadre idéalisé pour la conception des systèmes (concurrence déterministe), et fournit les outils nécessaires pour la programmation, la validation, la compilation et l'implantation du logiciel embarqué. L'outil Scade (Esterel technologies) est un exemple typique d'un environnement de programmation synchrone, utilisé en production dans l'industrie (commandes de vol Airbus et bien autres).
L'analyse temporelle cherche à vérifier que le temps de réaction du système implanté est conforme aux contraintes temps-réel de son environnement. Aux méthodes dynamiques classiques, basées sur le test et la mesure, se sont ajoutées à partir des années 90 des méthodes d'analyse statiques qui cherchent à déterminer des bornes supérieures garanties aux temps de réaction. Le problème de base de ces méthodes est l'estimation du pire temps d'exécution d'un code séquentiel donné sur une architecture donnée (Worst Case Execution Time, ou WCET).
Cet exposé présente les complémentarités des deux domaines, puis aborde plus spécifiquement des travaux récents qui visent à analyser la sémantique des programmes synchrones pour découvrir des chemins d'exécutions infaisables sur le code compilé, et donc, potentiellement, d'affiner les estimations de pire temps d'exécution.
Dans le contexte des systèmes embarqués critiques, l'objectif principal du logiciel n'est pas d'effectuer des calculs extrêmement longs et complexes à partir d'une donnée qui lui est fournie initialement, mais de réagir de façon sure et déterministe aux évènements externes qui lui sont communiqués de manière régulière tout au long de son exécution. La réalisation de tels logiciels peut se faire en utilisant des langages de programmation dédiés dont la conception et les techniques de compilation sont un sujet de recherche actif depuis une trentaine d'années. Les langages de programmation synchrones, comme Esterel ou Lustre, ont prouvé leur utilité dans la réalisation des systèmes réactifs. Après une introduction aux principes des langages synchrones, cet exposé présentera à travers le langage de programmation Scade 6 (dialecte de Lustre) les travaux de recherche d'Esterel Technologies sur la réalisation et la compilation d'un langage textuel et graphique. Nous évoquerons les mécanismes particuliers à Scade qui ont été introduit ainsi qu'un travail en cours sur l'utilisation de Scade pour la simulation de systèmes physiques.
Finding a near-optimal implementation of a program on a GPU is a daunting task. The programmer or the compiler must not only explore regular code optimizations but also handle the placement of the data in the memory hierarchy and choose a parallelism scheme. The solution we propose explores a set of possible implementation to find the optimal one for a given input and GPU model. The problem with this approach is that the search space has an exponential size. We thus propose a performance model to trim the search space and drive the search.
The seminar is based on the work conducted during an internship at Google.
PhD thesis defense.
This thesis addresses the design and implementation of a programming language for real-time stream processing applications, such as video decoding. The model of Kahn process networks is a natural fit for this area and has been used extensively. In this model, a program consists in a set of parallel processes communicating through single reader, single writer queues. The strength of the model lies in its determinism.
Synchronous functional languages like Lustre are dedicated to critical embedded systems. A Lustre program defines a Kahn process network which is synchronous, that is, which can be executed with finite queues and without deadlocks. This is enforced by a dedicated type system, the clock calculus, which establishes a global time scale throughout a program. The global time scale is used to define clocks: per-queue boolean sequences indicating, for each time step, whether a process produces or consumes a token in the queue. This information is used both for enforcing synchrony and for generating finite-state software or hardware.
We propose and study integer clocks, a generalization of boolean clocks featuring arbitrarily big natural numbers. Integer clocks describe the production or consumption of several values from the same queue in the course of a time step. We then rely on integer clocks to define the local time scale construction, which may hide time steps performed by a sub-program from the surrounding context. These principles are integrated into a clock calculus for a higher-order functional language. We study its properties, proving among other things that well-typed programs do not deadlock. We compile typed programs to digital synchronous circuits by adapting the clock-directed code generation scheme of Lustre. The typing information controls certain trade-offs between time and space in the generated circuits.
The members of the thesis committee are:
Loosely Time-Triggered Architectures (LTTAs) are a proposal for constructing distributed embedded control systems. They build on the quasi-periodic architecture, where computing units execute ‘almost periodically’, by adding a thin layer of middleware that facilitates the implementation of synchronous applications.
We show how the deployment of a synchronous application on a quasi-periodic architecture can be modeled using a synchronous formalism. Then we detail two protocols, Back-Pressure LTTA, reminiscent of elastic circuits, and Time-Based LTTA, based on waiting. Compared to previous work, we present controller models that can be compiled for execution and a new version of the Time-Based protocol that neither requires broadcast communication nor imposes global synchronization. We also compare the LTTA approach with architectures based on clock synchronization.
Based on an EMSOFT 2015 presentation.
PhD thesis defense.
We are interested in enabling compilers to automatically optimize and parallelize code. The polyhedral framework is showing promising results in this area. In this thesis, we address three limitations in the polyhedral framework and we present a practical solution to each one of these limitations. First, we address the problem of tiling code that has false dependences. Compilers cannot tile such code without privatization or expansion. To avoid this problem, we propose a new relaxed permutability criterion that allows a compiler to ignore false dependences between iteration-private live ranges and thus enables the compiler to tile the loops without privatization or expansion. Second, we address the problem of compilation time. To reduce the long compilation time of large programs, we introduce a technique that we call statement clustering and in which the original representation of the program is transformed into a new program representation (where each group of statements is represented as a macro-statement). The new representation of the program can be optimized faster than the original representation. The third limitation is related to the ability of compilers to automatically parallelize irregular code and to generate efficient code when information is missing to the compiler. To address this limitation, we introduce PENCIL, an intermediate language for DSL compilers and for accelerator programming. PENCIL is a subset of the C99 language carefully designed to capture static properties essential to enable more precise data dependence analysis. It provides also a set of language constructs that enable parallelizing compilers to generate more efficient code. The use of PENCIL would enable the compiler to generate optimized code that is quite difficult to generate otherwise.
The members of the thesis committee are:
The Sequentially Constructive Language (SCL) is a minimal synchronous language that captures the essence of the Sequentially Constructive model of computation (SC MoC), a recently proposed extension of the classical synchronous model of computation. The SC MoC uses sequential scheduling information to increase the class of constructive (legal) synchronous programs. This should facilitate the adoption of synchronous programming by users familiar with sequential programming in C or Java, thus simplifying the design of concurrent reactive/embedded systems with deterministic behavior. In this presentation, I focus on how to compile SCL down to data-flow equations, which ultimately can be synthesized to hardware or executed in software.
The task of modelling and simulating a complex physical system is often a challenge to engineers and researchers since the applications of interest routinely involve several time-scales, multiple sub-systems strongly or loosely coupled, and components whose behaviours are very distinct. The simulation of such systems is done in most software by a single numeric solver, which is often an inelegant solution, compared to harvesting the results of a herd of specialised solvers, certainly more efficient in their respective tasks than a general purpose solver. This talk will explore a few leads to help shepherd several solvers on the same simulation. Emphasis will be placed on the role of 0th-order value holders in the computation, and on the compilation of a distributed simulation.
With the growing core count on chip multiprocessors, contention has become an important factor of performance degradation. Yet, little efforts have been devoted to understanding how it affects performance of data structures. Some were devoted to introducing different definitions of contention whereas others were devoted to the implementation of efficient data structures with terrible complexity. In the first part of this talk, I motivate the mismatch between some complexity metrics and performance measurements by presenting a technique to reduce contention in logarithmic data structures. In the second part of this talk, I show that definitions of contention can be considered equivalent in some context and illustrate this with an efficient concurrent linked list.
In this talk, we claim that structured parallel programming greatly simplifies the task of writing correct and efficient parallel programs, while also capturing the flexibility of expressing asynchronous computations. We support this claim by introducing a family of structured parallel programming primitives developed in the Habanero Extreme Scale Software Research project at Rice University, and discussing semantic guarantees (e.g., deadlock freedom, data race freedom, determinism, serial elision) for different subsets of parallel programs that can be constructed with these primitives. We will then summarize results for compiler optimizations, runtime systems, repair and datarace detection for these subsets. If time permits, we will briefly discuss future directions for data-driven repair and synthesis of parallel programs in the new Pliny project.
The cost efficient integration of different applications with different levels of safety and security on a single computing platform has become a major industrial challenge. Such a platform would preferably be an off-the-shelf shared memory multicore processor. The performance requirements of individual applications and components also push for their parallelization. Focusing on the safety-critical domain, we work on extending to multicore processors with the established modular design and implementation methods used for synchronous applications. In this talk, I will describe our approach to the modular composition of synchronous components with hard real-time constraints. Preserving time-predictability is the most difficult barrier to the adoption of multicores in safety-critical systems. Our solution is compatible with off-the-shelf embedded shared-memory multiprocessors, introducing minor modifications to the syntax and semantics of Heptagon, a synchronous programming dialect of Lustre. It is demonstrated on an industrial use case from Alstom Transport, involving a time-triggered software stack designed on top of Linux and bare-metal components running on a Xilinx Zynq multicore platform. Our solution involves an original methodology for the design of "mixed-time-critical" applications. The safety levels and correctness by construction of synchronous designs is preserved. Performance and time predictability can be traded off against each other. This involves the partitioning of an application into modes of different time criticality, and is supported by time-critical backup scenarios when best-effort components miss their deadline. This work takes place in the context of the EMC2 project.
Programming accelerators such as GPUs with low-level APIs and languages like OpenCL and CUDA is difficult, error prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide this complexity and to regain some performance portability. In this presentation, I will present PENCIL (Platform-Neutral Compute Intermediate Language) and present some details about how it is compiled. PENCIL is a rigorously defined subset of GNU C99 with specific programming rules and few extensions. Adherence to this subset and the use of these extensions enable compilers to exploit parallelism and to better optimize code when targeting accelerators. We intend PENCIL both as a portable language to facilitate accelerator programming, and as an intermediate language for DSL compilers. We validate the potential of PENCIL on a state-of-the-art polyhedral compiler, extending the applicability of the compiler to dynamic, data-dependent control flow and non-affine array accesses. We use the polyhedral compiler to generate highly optimized OpenCL code for a set of standard benchmark suites (Rodinia and SHOC), image processing kernels, and for two DSL compilers: a linear algebra (BLAS) DSL compiler and a DSL for signal processing radar applications (SpearDE). To assess performance portability, we present experimental results on many GPU platforms: AMD Radeon HD 5670, Nvidia GTX470, and ARM Mali-T604 GPU.
Based on a PACT 2015 presentation.
Greybus is the name for a new application layer protocol on top of Unipro that controls the Ara Phone from Google. This protocol turns a phone into a modular device, allowing any part of the system to be hotplugged while the phone is running. This talk will describe what this protocol is, why it was designed, and give the basics for how it works. It will discuss how this is implemented in the Linux kernel, and how it easily bridges existing hardware like USB, I2C, GPIO and others with little to no changes needed to existing kernel drivers.
This talk presents libkpn, a high-performance runtime system for task parallel programming, based on Kahn process networks (KPNs). Away from the usual perception of KPNs as a high-level programming model compiled into general-purpose languages, we explore how KPNs can be used directly as concurrency specifications to guide dynamic scheduling in a runtime system. Through a comparison with existing models, we explain how features of KPNs can be taken advantage of to allow more logical parallelism to be expressed, and to create an efficient implementation.
This talk addresses the well-known challenge involved in simultaneously delivering high productivity and high parallel performance on modern multicore architectures—through the development of domain-specific languages (DSLs) and their optimizing code generators. It presents the domain of Image Processing Pipelines as the motivating case by presenting PolyMage, our DSL and its code generator for automatic and effective optimization of image processing pipelines. PolyMage takes an image processing pipeline expressed in a high-level language (embedded in Python) and generates an optimized C/C++ implementation of the pipeline. We show how certain techniques including those based on the polyhedral compiler framework need to be specialized in order to provide significant improvements in parallel performance over existing approaches including manual, library-based, and that of another state-of-the-art DSL (Halide). More information at http://mcl.csa.iisc.ernet.in/polymage.html. Experimental results on a modern multicore system and a short demo will also be presented.
The synchronous model of programming, which emerged in the 1980s and has led to the development of well-known languages such as Statecharts, Esterel, Signal, and Lustre, has made the programming of concurrent systems with deterministic and bounded reaction a routine exercise. The synchronous approach draws its success from the fact that it is founded on a simple and well-understood mathematical model of computation, the Mealy automaton. However, the validity of this model is not for free. It depends on the Synchrony Hypothesis according to which a system is invariably faster than its environment. Yet, this raises a tangled compositionality problem. Since a node is in the environment of every other node, it follows that each node must be faster than every other and hence faster than itself!
This talk presents some results towards solving this paradox, refining the classical Mealy model by constructive notions of input-output causality. Specifically, we propose a game-theoretic interpretation of the constructive semantics of step responses for synchronous languages, providing a coherent semantic framework encompassing both non-deterministic Statecharts (as per Pnueli & Shalev) and deterministic Esterel. In particular, it is shown that Esterel arises from a finiteness condition on strategies whereas Statecharts permits infinite games. Beyond giving a novel and unifying account of these concrete languages the talk sketches a general theory for obtaining different notions of constructive responses in terms of winning conditions for finite and infinite games and their characterisation as maximal post-fixed points of functions in directed complete lattices of intensional truth-values.
The presentation has two parts:
PART 1: AUTOMATED BUG DIAGNOSIS PART 2: VIRTUAL INSTRUCTION SET COMPUTING
I will present our work on two ongoing projects: (1) automated diagnosis of software failures, and (2) the benefits of using rich virtual instruction sets as a persistent representation of all software. If time permits, I will discuss a mythical connection between the two projects.
In the first part of the talk, I will describe new insights into analyzing the root causes of software failures automatically. We use dynamically observed likely invariants to identify candidate locations for the root cause of a failure, and then use novel static and dynamic analyses to filter out false positives and to extract valuable information relevant to understanding the failure. Experimental results show that we are able to narrow down the locations of root causes to just a few (e.g., 2-10) program locations, even in large programs, for nearly all bugs we have tested. We are now exploring how to automate the process fully using dynamic symbolic execution, and how to apply these fault isolation techniques for diagnosis and triaging of software failures in production software.
Next, I will discuss a project on Virtual Instruction Set Computing (VISC). Despite extensive advances in managed run-time systems and dynamic compilation, high performance applications and security sensitive systems software are almost universally compiled and shipped as native machine code. In the VISC project, we are exploring the security, reliability, and performance implications of shipping all such software in virtual instruction set form. We use the LLVM bitcode language as the shipping representation to enable advanced "lifelong" program analysis and transformation for such programs. We have shown that shipping operating systems in LLVM form enables powerful security guarantees to be enforced using compiler techniques. We are now exploring how the approach could be used to improve programmability and portability for heterogeneous parallel systems, and to enable novel performance optimization techniques.
De nombreux outils sont aujourd'hui disponibles pour l'analyse et la production temps réel de flux média temporisés : sons, vidéo, animations, etc. Néanmoins, la coordination de ces outils, la synchronisation des flux qu'ils analysent et produisent, sur des échelles de temps de valeurs et même de nature différentes, reste une affaire délicate.
Le modèle des Tiled Polymorphic Temporal Media (ou TPTM), qui combine en un même formalisme le contenu media de ces flux et leurs marqueurs de synchronisations, vise à remédier à cela. Dans le modèle, le produit de deux flux ainsi enrichi, paramétré par ces marqueurs de synchronisations, est tout à la fois séquentiel et parallèle : c’est un produit tuilé.
En théorie, la sémantique de ces flux tuilés peut être décrite dans la théorie des monoïdes inversifs. En pratique, nous avons réalisé, en Haskell, la première implémentation réellement polymorphe et inversive de ces Tiled Polymorphic Temporal Media. Notre implémentation permet en outre, via le mécanisme d’évaluation paresseuse d’Haskell, de distinguer simplement la syntaxe de ces flux – un système d’équations tuilées – de leur sémantique opérationnelle – la résolution de ce système à la volée.
Nous montrerons aussi comment nos constructions se généralisent simplement aux tuiles infinies.
Ce travail a été partiellement réalisé en collaboration avec Paul Hudak (Yale) et Théis Bazin (Stagiaire, ENS Cachan).
The IC3 procedure by A. Bradley is a recent model checking procedure for proving safety properties of finite-state systems that has shown to be quite effective in practice. This talk presents an SMT-based version of IC3 that works with both finite- and infinite-state systems. A distinguishing feature of our approach is the use of approximate quantifier elimination to generalize counterexamples. We provide experimental evidence that for linear integer arithmetic our method is superior in practice to full-blown quantifier elimination, and that our IC3 implementation is competitive with a state-of-the-art k-induction model checker for infinite-state systems. Our experiments also show that IC3 and k-induction have complementary strengths and so can be combined advantageously in a multi-engine parallel checker.
The field of model checking owes much of its great success and impact to the use of symbolic techniques to reason efficiently about functional properties of hardware or software systems. Traditionally, these techniques have relied on propositional encodings of transition systems and on propositional reasoning engines such as binary decision diagrams and SAT solvers. More recently, a number of these techniques have been adapted, and new ones have been devised, based instead on encodings into fragments of first-order logic supported by Satisfiability Modulo Theories (SMT) solvers. These are highly efficient solvers specialized on checking the satisfiability of formulas in certain logical theories such as the theory of linear arithmetic, bit-vectors, arrays, algebraic data types, and so on. SMT-based model checking methods blur the line between traditional (propositional) model checking and traditional (first or higher order) deductive verification. More crucially, they combine the best features of both by offering the scalability and scope of deductive verification while maintaining comparable levels of automations as propositional model checking. This talk will briefly introduce SMT and then give an overview of SMT-based model checking, focusing on a number of notable approaches and techniques.
FoCaLiZe is a programming language allowing to state in a same framework algorithms (programming aspects), properties (logical aspects) that these algorithms must satisfy and proofs these latter hold. It provides high-level constructs to ease development structuring and reuse (modularity, inheritance, late-binding, parameterization) and a proof language allowing to write proofs in a hierarchical way, using a natural deduction style, which rely on the Zenon automated theorem prover to discharge goals ... and the burden of the user. The core programming language is a pure functional language while properties are stated a first-order logic formulae. A FoCaLiZe development is compiled in two target languages: one OCaml source code leading to an executable (or object file) and one Coq source code containing the complete model of the development (i.e. definitions, properties and proofs) that will be ultimately checked by Coq as assessment. A crucial care in this process is to avoid any error from the target languages being returned to the user: a program accepted by the FoCaLiZe compiler must be accepted by both OCaml and Coq. To increase confidence in "what is ran is what is proved" despite two target languages, the compiler has to use a code generation model identical and traceable for both of them.
In this presentation we will introduce the motivations, design and features choices that impacted the language semantics. The presentation of the language features and the development style will also be addressed on the basis of examples, trying to illustrate advantages (and disavantages) of such a system compared to other existing formal proofs systems.
A functional intermediate language is a simple and effective formalization of static single assignment (SSA) for the purpose of verified compilation of an imperative language. Two translations are of particular importance: Translating from the imperative source language (e.g. C) to the functional language, and translating from the functional intermediate language to an imperative target language (e.g. assembly).
This talk discusses the above translations in the context of the intermediate language IL. IL comes with an imperative and a functional interpretation. IL/I (imperative interpretation) is a register transfer language close to assembly. IL/F (functional interpretation) is a first-order functional language with a tail-call restriction. We devise the decidable notion of coherence to identify programs on which the semantics of IL/I and IL/F coincide. Translations between the interpretations are obtained from equivalence-preserving translations to the coherent fragment. The translation from IL/I to IL/F corresponds to SSA construction. The translation from IL/F to IL/I corresponds to SSA deconstruction and involves register assignment.
The translations are implemented and proven correct using the proof assistant Coq. Extraction yields translation-validated implementations of SSA construction and deconstruction.
On modern multicore systems, the interaction between co-run programs largely depends on cache sharing, and cache sharing depends on the locality, i.e. the active data usage, in co-run programs. This talk introduces a newly developed locality theory based on a concept called the footprint. The theory shows the composability of footprint and the conversion between footprint and other locality metrics. It enables accurate characterization and analysis of the dynamics of cache sharing.
Designing and implementing thread-safe multithreaded libraries can be a daunting task as developers of these libraries need to ensure that their implementations are free from concurrency bugs including deadlocks. The usual practice involves employing software testing and/or dynamic analysis to detect deadlocks. Their effectiveness is dependent on well-designed multithreaded test cases. Unsurprisingly, developing multithreaded tests is easily an order of magnitude harder than developing single threaded tests for obvious reasons.
In this talk, we will address the problem of automatically synthesizing multithreaded tests that can induce deadlocks. The key insight to our approach is that a subset of the properties observed when a deadlock manifests in a concurrent execution can also be observed in a single threaded execution. We design a novel, automatic, scalable and directed approach that identifies these properties and synthesizes a multithreaded test that realizes these properties simultaneously revealing a potential deadlock. The input to our approach is only the library implementation under consideration and the output is a set of deadlock revealing multithreaded tests. We will present the results of our analysis on synthesizing multithreaded tests for Java libraries. We find 35 real deadlocks in thread safe classes in COLT, a popular high performance scientific computing library.
Multicore systems yield high performance by the execution of multiple concurrent software processes in the system in a power and area efficient way. While multicore systems are promising architectures for higher performance, the soft error reliability of these systems becomes a major concern due to smaller transistor sizes and low power operating modes.
My talk will include work during my Phd and recent work, related to reliability and performance analysis of multicore systems.
I am currently working on ARTEMIS PaPP project. PaPP (Portable and Predictable Performance on Heterogeneous Embedded Manycores) aims to achieve predictable performance portability of software running on different parallel platforms. We focus on performance analysis modelling for task-based parallel applications within multicore systems. Our modelling targets to find out minimal resources which satisfy performance requirements of an application and to develop task scheduling strategies as an input to run-time for better resource management and performance optimization.
In my PhD study, I was working on reliability analysis of multicore systems. I propose Thread Vulnerability Factor metric, to quantify vulnerability of thread and to qualify the relative vulnerability of parallel applications to soft errors, and utilize the metric for performance-vulnerability analysis of parallel programs, reliability-aware core partitioning schemes for multicore architectures, and critical thread identification for redundant execution in a partial fault tolerance method.
Mezzo is a typed programming language in the ML family whose static discipline controls aliasing and ownership. This rules out certain mistakes, including representation exposure and data races, and opens up new opportunities, such as gradual initialization or (more generally, and somewhat speculatively) the definition and enforcement of object protocols.
In this talk, I will explain the basic design of Mezzo on examples, and show how its type system gives a simple way of tracking ownership of fragments of shared memory between concurrent threads. Beyond the usual claim that "well-typed programs do not go wrong", we guarantee that well-typed Mezzo programs have no data-races.
The presentation will have two parts. In the first part, we will present an on-going research project to provide a run-time system with "guidelines", using task graph properties, and a static analysis phase to prioritize certain tasks of a given application over the others. The order in which tasks of an application task graph are considered has a significant influence on the resulting schedule length. Gauging the importance of the tasks with a priority scheme is therefore a fundamental part of scheduling schemes. We have used the concept of paths and path lengths to gauge the importance of tasks in an offline static analysis phase and provide guidelines to the run-time system for priority assignment.
In the second part of my presentation, we will present earlier research results that focuses on power/energy efficient real-time scheduling. We will discuss at least one such technique to elaborate the concept.
Cohabitation PharOS/EmbeddedC dans une application de contrôle/commande de moteur électrique pour du steer-by-wire avec retour d'effort Démonstration.