Harvesting the power of modern graphics hardware to solve the complex problem of real-time rendering of large unstructured meshes is a major research goal in the volume visualization community. While, for regular grids, texture-based techniques are well-suited for current GPUs, the steps necessary for rendering unstructured meshes are not so easily mapped to current hardware. We propose a novel volume rendering technique that simplifies the CPU-based processing and shifts much of the sorting burden to the GPU, where it can be performed more efficiently. Our hardware-assisted visibility sorting algorithm is a hybrid technique that operates in both object-space and image-space In object-space, the algorithm performs a partial sort of the 3D primitives in preparation for rasterization. The goal of the partial sort is to create a list of primitives that generate fragments in nearly sorted order. In image-space, the fragment stream is incrementally sorted using a fixed-depth sorting network. In our algorithm, the object-space work is performed by the CPU and the fragment-level sorting is done completely on the GPU. A prototype implementation of the algorithm demonstrates that the fragment-level sorting achieves rendering rates of between one and six million tetrahedral cells per second on an ATI Radeon 9800.
Dissertação (mestado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia Elétrica, Florianópolis, 2014; O Fluxo de Transporte (Transport Stream - TS) MPEG-2 é um formato amplamente utilizado em sistemas de TV Digital para a transmissão de áudio, vídeo e informações relacionadas a programa. Entre outras informações, um fluxo de transporte carrega uma referência de tempo, conhecida como Referência de Relógio de Programa (Program Clock Reference - PCR), a qual é um retrato do relógio de 27 MHz do sistema. Esta informação permite a recuperação do relógio nos receptores, o qual garante a correta apresentação do conteúdo e até mesmo controla interfaces de saída. Porém, se o tempo de chegada dos pacotes de transporte variar durante a transmissão ou o processamento, tal cenário pode levar a erros no relógio do sistema, o que é conhecido como jitter. Os métodos tradicionais para a correção da informação do relógio de programa normalmente são baseadas em contadores/acumuladores de 27MHz com ponto flutuante, porém, não mitigam o jitter de PCR completamente. Métodos mais recentes usam contador/acumulador controlado por semáforo, e até mesmo propõem um esquema de adaptação de taxa integrada à correção da referência de relógio...
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU–CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU–CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU–CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
We present a stream algorithm for the Singular-Value Decomposition (SVD) of anM X N matrix A. Our algorithm trades speed of numerical convergence for parallelism,and derives from a one-sided, cyclic-by-rows Hestenes SVD. Experimental results showthat we can create O(M) parallelism, at the expense of increasing the computationalwork by less than a factor of about 2. Our algorithm qualifes as a stream algorithmin that it requires no more than a small, bounded amount of local storage per processor and its compute efficiency approaches an optimal 100% asymptotically for largenumbers of processors and appropriate problem sizes.
Journal Paper; Stream processing promises to bridge the gap between inflexible special-purpose solutions and current programmable architectures that cannot meet the computational demands of media-processing applications.
Journal Paper; The Power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors. Executing programs mapped to streams and kernels, a single Imagine processor is expected to have a peak performance of 20 Gflops and sustain 18.3 GOPS on MPEG-2 encoding.
Tech Report; This paper presents the design and use of reconfigurable stream processors for the physical layer processing in wireless base-stations. Stream processors, traditionally used for high performance media processing, use clusters of functional units to provide support for hundreds of functional units in a programmable architecture. We provide hardware support for reconfiguration in stream processors, enabling them to be power-efficient by adapting to the compute requirements of the application. We demonstrate the real-time implementation of a 32-user wireless base-station, employing multiuser channel estimation, multiuser detection and Viterbi decoding physical layer algorithms, supporting a data rate of 128 Kbps/user. The reconfigurable stream processor runs at 1.2 GHz and has an estimated power consumption of 12.38 W at full workload. However, basestations rarely operate at full capacity. When the base-station workload decreases, the reconfigurable stream processor adapts the number of clusters, functional units, voltage and frequency dynamically for power efficiency. When the application workload changes to 4 users, the reconfiguration support reduces the power to 300 mW at 433 MHz, providing a 41.27X decrease in power consumption. The cluster reconfiguration yields an additional 15-85% power savings over a stream processor with dynamic voltage and frequency scaling.
Conference Paper; Stream processors support hundreds of functional units in a programmable architecture by clustering functional units and utilizing a bandwidth hierarchy. Clusters are the dominant source of power consumption in stream processors. When the data parallelism falls below the number of clusters, unutilized clusters can be turned off to save power. This paper improves power efficiency in stream processors by dynamically reconfiguring the number of clusters in a stream processor to match the time varying data parallelism of an application. We explore 3 mechanisms for dynamic reconfiguration: using memory, conditional streams and a multiplexer network. A 32-user wireless basestation is a prime example of a workload that benefits from such reconfiguration. When the number of users supported by the basestation dynamically changes from 32 to 4, the reconfiguration from a 32-cluster stream processor to a 4-cluster stream processor yields 15-85% power savings over and above a stream processor that uses conventional power saving techniques such as dynamic voltage and frequency scaling. The dynamic reconfiguration support extends stream processors from traditional high performance applications to power-sensitive applications in which the data parallelism varies dynamically and falls below the number of clusters.
Journal Paper; We present a design framework for rapidly exploring the design space for stream processors in real-time embedded systems. Stream processors enable hundreds of arithmetic units in programmable pro-cessors by using clusters of functional units. However, to meet a certain real-time requirement for an embedded system, there is a trade-off between the number of arithmetic units in a cluster, number of clusters and the clock frequency as each solution meets real-time with a different power consumption. We have developed a design exploration tool that explores this trade-off and presents a heuristic that minimizes the power consumption in the (functional units, clusters, frequency) design space. Our design methodology relates the instruction level parallelism, subword parallelism and data parallelism to the organization of the functional units in an embedded stream processor. We show that the power minimization methodology also provides insights into the functional unit utilization of the processor. The design exploration tool exploits the static nature of signal processing workloads, providing an extremely fast design space exploration and provides an initial lower bound estimate of the real-time performance of the embedded processor. A sensitivity analysis of the design tool results to the technology and modeling also enables the designer to check the robustness of the design exploration.
Emerging applications such as high definition television (HDTV), streaming video, image processing in embedded applications and signal processing in high-speed wireless communications are driving a need for high performance digital signal processors (DSPs) with real-time processing. This class of applications demonstrates significant data parallelism, finite precision, need for power-efficiency and the need for 100's of arithmetic units in the DSP to meet real-time requirements. Data-parallel DSPs meet these requirements by employing clusters of functional units, enabling 100's of computations every clock cycle. These DSPs exploit instruction level parallelism and subword parallelism within clusters, similar to a traditional VLIW (Very Long Instruction Word) DSP, and exploit data parallelism across clusters, similar to vector processors.
Stream processors are data-parallel DSPs that use a bandwidth hierarchy to support dataflow to 100's of arithmetic units and are used for evaluating the contributions of this thesis. Different software realizations of the dataflow in the algorithms can affect the performance of stream processors by greater than an order-of-magnitude. The thesis first presents the design of signal processing algorithms that map efficiently on stream processors by parallelizing the algorithms and by re-ordering the flow of data. The design space for stream processors also exhibits trade-offs between arithmetic units per cluster...
PhD Thesis; Emerging applications such as high definition television (HDTV), streaming video, image processing in embedded applications and signal processing in high-speed wireless communications are driving a need for high performance digital signal processors (DSPs) with real-time processing. This class of applications demonstrates significant data parallelism, finite precision, need for power-efficiency and the need for 100's of arithmetic units in the DSP to meet real-time requirements. Data-parallel DSPs meet these requirements by employing clusters of functional units, enabling 100's of computations every clock cycle. These DSPs exploit instruction level parallelism and subword parallelism within clusters, similar to a traditional VLIW (Very Long Instruction Word) DSP, and exploit data parallelism across clusters, similar to vector processors. Stream processors are data-parallel DSPs that use a bandwidth hierarchy to support dataflow to 100's of arithmetic units and are used for evaluating the contributions of this thesis. Different software realizations of the dataflow in the algorithms can affect the performance of stream processors by greater than an order-of-magnitude. The thesis first presents the design of signal processing algorithms that map efficiently on stream processors by parallelizing the algorithms and by re-ordering the flow of data. The design space for stream processors also exhibits trade-offs between arithmetic units per cluster...
American Axle & Manufacturing, Inc. (AAM) is still in the process of transitioning to a culture of "lean manufacturing" as opposed to the current culture of "mass production". This thesis involved working with AAM employees and suppliers at various locations to understand how material flows between and within AAM's plants, the reasons for and analysis of the current state of material management, and opportunities for improvement. Attention is also given to the cultural and business context in which this work takes place, and the issues relating to efforts to implement change in large industrial organizations. This work produced two strategic-level products and one tactical-level product to improve lean material management at AAM described herein. Cultural observations are also provided. At the strategic level, one project focused upon making extended value stream maps of material flow between AAM plants and suppliers/processors. This information allows all decision-makers at AAM to objectively examine a common set of information, information which was previously unavailable to any one individual. Extended value stream mapping allowed supply chain inventory and lead time-reduction opportunities to be identified.; (cont.) The focus upon extended value streams increased awareness of the need to more fully account for costs in making part procurement decisions. Therefore...
The ability to process large numbers of continuous data streams in a
near-real-time fashion has become a crucial prerequisite for many scientific
and industrial use cases in recent years. While the individual data streams are
usually trivial to process, their aggregated data volumes easily exceed the
scalability of traditional stream processing systems. At the same time,
massively-parallel data processing systems like MapReduce or Dryad currently
enjoy a tremendous popularity for data-intensive applications and have proven
to scale to large numbers of nodes. Many of these systems also provide
streaming capabilities. However, unlike traditional stream processors, these
systems have disregarded QoS requirements of prospective stream processing
applications so far. In this paper we address this gap. First, we analyze
common design principles of today's parallel data processing frameworks and
identify those principles that provide degrees of freedom in trading off the
QoS goals latency and throughput. Second, we propose a highly distributed
scheme which allows these frameworks to detect violations of user-defined QoS
constraints and optimize the job execution without manual interaction. As a
proof of concept, we implemented our approach for our massively-parallel data
processing framework Nephele and evaluated its effectiveness through a
comparison with Hadoop Online. For an example streaming application from the
multimedia domain running on a cluster of 200 nodes...
Langevin Dynamics, Monte Carlo, and all-atom Molecular Dynamics simulations
in implicit solvent, widely used to access the microscopic transitions in
biomolecules, require a reliable source of random numbers. Here we present the
two main approaches for implementation of random number generators (RNGs) on a
GPU, which enable one to generate random numbers on the fly. In the
one-RNG-per-thread approach, inherent in CPU-based calculations, one RNG
produces a stream of random numbers in each thread of execution, whereas the
one-RNG-for-all-threads approach builds on the ability of different threads to
communicate, thus, sharing random seeds across the entire GPU device. We
exemplify the use of these approaches through the development of Ran2, Hybrid
Taus, and Lagged Fibonacci algorithms fully implemented on the GPU. As an
application-based test of randomness, we carry out LD simulations of N
independent harmonic oscillators coupled to a stochastic thermostat. This model
allows us to assess statistical quality of random numbers by comparing the
simulation output with the exact results that would be obtained with truly
random numbers. We also profile the performance of these generators in terms of
the computational time, memory usage, and the speedup factor (CPU/GPU time).; Comment: 32 pages...
This paper investigates the operator mapping problem for in-network
stream-processing applications. In-network stream-processing amounts to
applying one or more trees of operators in steady-state, to multiple data
objects that are continuously updated at different locations in the network.
The goal is to compute some final data at some desired rate. Different operator
trees may share common subtrees. Therefore, it may be possible to reuse some
intermediate results in different application trees. The first contribution of
this work is to provide complexity results for different instances of the basic
problem, as well as integer linear program formulations of various problem
instances. The second second contribution is the design of several
polynomial-time heuristics. One of the primary objectives of these heuristics
is to reuse intermediate results shared by multiple applications. Our
quantitative comparisons of these heuristics in simulation demonstrates the
importance of choosing appropriate processors for operator mapping. It also
allow us to identify a heuristic that achieves good results in practice.
We define representations of continuous functions on infinite streams of
discrete values, both in the case of discrete-valued functions, and in the case
of stream-valued functions. We define also an operation on the representations
of two continuous functions between streams that yields a representation of
In the case of discrete-valued functions, the representatives are
well-founded (finite-path) trees of a certain kind. The underlying idea can be
traced back to Brouwer's justification of bar-induction, or to Kreisel and
Troelstra's elimination of choice-sequences. In the case of stream-valued
functions, the representatives are non-wellfounded trees pieced together in a
coinductive fashion from well-founded trees. The definition requires an
alternating fixpoint construction of some ubiquity.
We discuss the suitability of spreadsheet processors as tools for programming
streaming systems. We argue that, while spreadsheets can function as powerful
models for stream operators, their fundamental boundedness limits their scope
of application. We propose two extensions to the spreadsheet model and argue
their utility in the context of programming streaming systems.; Comment: In Proceedings of the 2nd Workshop on Software Engineering Methods in
This paper presents a stream processor generator, called SPGen, for
FPGA-based system-on-chip platforms. In our research project, we use an FPGA as
a common platform for applications ranging from HPC to embedded/robotics
computing. Pipelining in application-specific stream processors brings FPGAs
power-efficient and high-performance computing. However, poor productivity in
developing custom pipelines prevents the reconfigurable platform from being
widely and easily used. SPGen aims at assisting developers to design and
implement high-throughput stream processors by generating their HDL codes with
our domain-specific high-level stream processing description, called SPD.With
an example of fluid dynamics computation, we validate SPD for describing a real
application and verify SPGen for synthesis with a pipelined data-flow graph. We
also demonstrate that SPGen allows us to easily explore a design space for
finding better implementation than a hand-designed one.; Comment: Presented at First International Workshop on FPGAs for Software
Programmers (FSP 2014) (arXiv:1408.4423)
Stream computing is the use of multiple autonomic and parallel modules
together with integrative processors at a higher level of abstraction to embody
"intelligent" processing. The biological basis of this computing is sketched
and the matter of learning is examined.; Comment: 7 pages, 4 figures
Tipo: Book Section; PeerReviewedFormato: application/pdf
Publicado em //2006Português
Relevância na Pesquisa
Stream processing systems receive continuous streams
of messages with raw information and produce streams
of messages with processed information. The utility of a
stream-processing system depends, in part, on the accuracy
and timeliness of the output. Streams in complex event processing
systems are processed on distributed systems; several
steps are taken on different processors to process each
incoming message, and messages may be enqueued between
steps. This paper deals with the problems of distributed dynamic
control of streams to optimize the total utility provided
by the system. A challenge of distributed control is
that timeliness of output depends only on the total end-toend
time and is otherwise independent of the delays at each
separate processor whereas the controller for each processor
takes action to control only the steps on that processor
and cannot directly control the entire network.
This paper identifies key problems in distributed control
and analyzes two scheduling algorithms that help in an initial
analysis of a difficult problem.