,
PhD thesis

PhD thesis: Architecture-Dependent Linear Code Optimizations
Author: Ivan Šimeček
Keywords

cache memory, analytical cache model, software transformations, code restructuring, loop unrolling, unroll-and-jam, dynamic loop reversal, Cholesky factorization, Conjugate gradient method, Biconjugate method, register blocking format


Abstract

This thesis is divided into three tightly linked parts:
* analytical models of cache behavior for pseudocodes from the linear algebra,
* source code transformations of linear algebra algorithms,
* optimization of sparse matrix-vector multiplication.

Development of analytical models of cache behavior
We have developed analytical models for the estimation of the numbers of cache misses for some basic pseudocodes from the linear algebra. Using hardware and software cache monitoring tools, we have compared the predicted number of cache misses with real numbers on Intel x86 architecture with L1 and L2 caches. The accuracy of our analytical model is very high, about 90% for the L1 cache and 95% for the L2 cache. The errors in estimations are due to minor simplifying assumptions in our models. The main contributions of this part of the thesis are the following: * demonstration of two types of analytical caches models,
* a detailed cache behavior analysis of some LA pseudocodes.

Source code transformations of linear codes
We focus primarily on detailed performance study of the following algorithms:
* dot product,
* matrix-matrix multiplication,
* the Cholesky factorization,
* sparse matrix-vector multiplication,
* the Conjugate Gradient method,
* the Biconjugate Gradient method.
Straightforward implementations of these algorithms lead to ine±cient codes. Our primary goal is to achieve the performance improvement of these algorithms with respect to their CPU architecture. We demonstrate the e±ciency of this approach on various platforms. Our main focus is to develop highly e±cient codes running on Intel CPUs. The main contributions of this part of the thesis are as follows:
* A detailed study of the mentioned algorithms.
* A demonstration of software transformation techniques.
* A performance study of some software transformation techniques.
* A proposal of new implementation improvements.
* A proposal of a new loop transformation technique called dynamic loop reversal.

Optimization of the sparse matrix-vector multiplication
Many authors have approached the problem of the sparse matrix-vector multiplication optimization from the viewpoint of the matrix storage format transformation, but most of the published works overlooked the problem of the transformation overhead. We have tried to optimize performance of the sparse matrix-vector multiplication, but not at all cost. The transformation (and the resulting speedup) must be su±ciently fast to amortize the overhead of the transformation. The main contributions of this part of this thesis are the following:
* A performance analysis of the sparse matrix-vector multiplication.
* New region-based formats called R-XY and R-CSR.
* A new storage format called CARB designed to accelerate the sparse matrix-vector multiplication.
* A new approach to the acceleration of the sparse matrix-vector multiplication.

 

Download:
final version (in .PDF format)

BibTex entry:
@PHDTHESIS{JA_PHD,
author = {{\v S}ime{\v c}ek, I.},
title = {Architecture-Dependent Linear Code Optimizations},
school = {the Faculty of Electrical Engineering, Czech Technical University in Prague},
year = {2008},
keywords = {cache memory, analytical cache model, software transformations, code restructuring, loop unrolling, unroll-and-jam, dynamic loop reversal, Cholesky factorization, Conjugate gradient method, Biconjugate method, register blocking format},
url = {http://shimi.webzdarma.cz/vyzkum/phd/PhDThesis.pdf},
language = {English}
}

BACK