Home File format optimized for sparse matrix exchange

# File format optimized for sparse matrix exchange

gc5
1#
gc5 Published in 2018-02-13 16:08:43Z
 I want to save a sparse matrix of numbers (integers, but it could be floats) to a file for data exchange. For sparse matrix I mean a matrix where a high percentage of values (typically 90%) are equal to 0. Sparse in this case does not relate to the file format but to the actual content of the matrix. The matrix is formatted in the following way:  col1 col2 .... row1 int1_1 int1_2 .... row2 int2_1 .... .... .... .... .... ....  By using a text file (tab-delimited) the size of the file is 4.2G. Which file format, preferably ubiquitous such as a .txt file, can I use to easily load and save this sparse data matrix? We usually work with Python/R/Matlab, so formats that are supported by these are preferred.
gc5
2#
gc5 Reply to 2018-02-14 01:53:21Z

I found the Feather format (which currently does not support Matlab, afaik).

Some comparison on reading and writing, and memory performance in Pandas is provided in this section.

It provides also support for the Julia language.

### Edit:

I found that this format in my case uses more disk space than the .txt one, probably to increase performance in I/O. Compressing with zip alleviates the problem but compression during writing seems to not be supported yet.

Wybird666
3#
Wybird666 Reply to 2018-02-14 01:27:00Z

You have several solutions, but generally what you need to do it output the indices of the non-zero elements as well as the values. Lets assume that you want to export to a single text file.

## Generate array

Lets first generate a 10000 x 5000 sparse array with ~10% filled (it will be a bit less due to replicated indices):

N = 10000;
M = 5000;
rho = .1;
rN = ceil(sqrt(rho)*N);
rM = ceil(sqrt(rho)*M);
S = sparse(N, M);
S(randi(N, [rN 1]), randi(M, [rM 1])) = randi(255, rN, rM);


If your array is not stored as a sparse array, you can create it simply using (where M is the full array):

S = sparse(M);


## Save as text file

Now we will save the matrix in the following format row_indx col_indx value row_indx col_indx value row_indx col_indx value

This is done by extracting the row and column indices as well as data values and then saving it to a text file in a loop:

[n, m, s] = find(S);
fid = fopen('Sparse.txt', 'wt');
arrayfun(@(n, m, s) fprintf(fid, '%d\t%d\t%d\n', n, m, s), n, m, s);
fclose(fid);


If the underlying data is not an integer, then you can use the %f flag on the last output, e.g. (saved with 15 decimal places)

arrayfun(@(n, m, s) fprintf(fid, '%d\t%d\t%.15f\n', n, m, s), n, m, s);


Compare this to the full array:

fid = fopen('Full.txt', 'wt');
arrayfun(@(n) fprintf(fid, '%s\n', num2str(S(n, :))), (1:N).');
fclose(fid);


In this case, the sparse file is ~50MB and the full file ~170MB representing a factor of 3 efficiency. This is expected since I need to save 3 numbers for every nonzero element of the array, and ~10% of the array is filled, requiring ~30% as many numbers to be saved compared to the full array.

For floating point format, the saving is larger since the size of the indices compared to the floating point value is much smaller.

In Matlab, a quick way to extract the data would be to save the string given by:

mat2str(S)


This is essentially the same but wraps it in the sparse command for easy loading in Matlab - one would need to parse this in other languages to be able to read it in. The command tells you how to recreate the array, implying you may need to store the size of the matrix in the file as well (I recommend doing it in the first line since you can read this in and create the sparse matrix before parsing the rest of the file.

## Save as binary file

A much more efficient method is to save as a binary file. Assuming the data and indices can be stored as unsigned 16 bit integers you can do the following:

[n, m, s] = find(S);
fid = fopen('Sparse.dat', 'w');
fwrite(fid, size(S), 'uint16');
fwrite(fid, [n m s], 'uint16');
fclose(fid);


Then to read the data:

fid = fopen('Sparse.dat', 'r');
sz = fread(fid, 2, 'uint16');
s = reshape(fread(fid, 'uint16'), [], 3);
s = sparse(s(:, 1), s(:, 2), s(:, 3), sz(1), sz(2));
fclose(fid);


Now we can check they are equal:

isequal(S, s)


Saving the full array:

fid = fopen('Full.dat', 'w');
fwrite(fid, full(S), 'uint16');
fclose(fid);


Comparing the sparse and full file sizes I get 21MB and 95MB.

A couple of notes:

1. Using a single write/read command is much (much much) quicker than looping, so the last method is by far the fastest, and also most space efficient.
2. The maximum index/data value size that can be saved as a binary integer is 2^n - 1, where n is the bitdepth. In my example of 16 bits (uint16), that corresponds to a range of 0..65,535. By the sounds of it, you may need to use 32 bits or even 64 bits just to store the indices.
3. Higher efficiency can be obtained by saving the indices as one data type (e.g. uint32) and the actual values as another (e.g. uint8). However, this adds additional complexity in the saving and reading.
4. You will still want to store the matrix size first, as I showed in the binary example.
5. You can store the values as doubles if required, but indices should always be integers. Again, extra complexity, but doable.
 You need to login account before you can post.
Processed in 0.299853 second(s) , Gzip On .