Home Why is reading lines from stdin much slower in C++ than Python?
Reply: 8

Why is reading lines from stdin much slower in C++ than Python?

Eric
1#
Eric Published in 2018-01-30 02:38:39Z

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I'm not yet an expert Pythonista, please tell me if I'm doing something wrong or if I'm misunderstanding something.


(TLDR answer: include the statement: cin.sync_with_stdio(false) or just use fgets instead.

TLDR results: scroll all the way down to the bottom of my question and look at the table.)


C++ code:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

Python Equivalent:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

Here are my results:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

Edit: I should note that I tried this both under Mac OS X v10.6.8 (Snow Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.

Edit 2: (Removed this edit, as no longer applicable)

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

Edit 3:

Okay, I tried J.N.'s suggestion of trying having Python store the line read: but it made no difference to python's speed.

I also tried J.N.'s suggestion of using scanf into a char array instead of getline into a std::string. Bingo! This resulted in equivalent performance for both Python and C++. (3,333,333 LPS with my input data, which by the way are just short lines of three fields each, usually about 20 characters wide, though sometimes more).

Code:

char input_a[512];
char input_b[32];
char input_c[512];
while(scanf("%s %s %s\n", input_a, input_b, input_c) != EOF) {
    line_count++;
};

Speed:

$ cat test_lines | ./readline_test_cpp2
Read 10000000 lines in 3 seconds. LPS: 3333333
$ cat test_lines | ./readline_test2.py
Read 10000000 lines in 3 seconds. LPS: 3333333

(Yes, I ran it several times.) So, I guess I will now use scanf instead of getline. But, I'm still curious if people think this performance hit from std::string/getline is typical and reasonable.

Edit 4 (was: Final Edit / Solution):

Adding:

cin.sync_with_stdio(false);

Immediately above my original while loop above results in code that runs faster than Python.

New performance comparison (this is on my 2011 MacBook Pro), using the original code, the original with the sync disabled, and the original Python code, respectively, on a file with 20M lines of text. Yes, I ran it several times to eliminate disk caching confound.

$ /usr/bin/time cat test_lines_double | ./readline_test_cpp
       33.30 real         0.04 user         0.74 sys
Read 20000001 lines in 33 seconds. LPS: 606060
$ /usr/bin/time cat test_lines_double | ./readline_test_cpp1b
        3.79 real         0.01 user         0.50 sys
Read 20000000 lines in 4 seconds. LPS: 5000000
$ /usr/bin/time cat test_lines_double | ./readline_test.py
        6.88 real         0.01 user         0.38 sys
Read 20000000 lines in 6 seconds. LPS: 3333333

Thanks to @Vaughn Cato for his answer! Any elaboration people can make or good references people can point to as to why this synchronisation happens, what it means, when it's useful, and when it's okay to disable would be greatly appreciated by posterity. :-)

Edit 5 / Better Solution:

As suggested by Gandalf The Gray below, gets is even faster than scanf or the unsynchronized cin approach. I also learned that scanf and gets are both UNSAFE and should NOT BE USED due to potential of buffer overflow. So, I wrote this iteration using fgets, the safer alternative to gets. Here are the pertinent lines for my fellow noobs:

char input_line[MAX_LINE];
char *result;

//<snip>

while((result = fgets(input_line, MAX_LINE, stdin )) != NULL)
    line_count++;
if (ferror(stdin))
    perror("Error reading stdin.");

Now, here are the results using an even larger file (100M lines; ~3.4 GB) on a fast server with very fast disk, comparing the Python code, the unsynchronised cin, and the fgets approaches, as well as comparing with the wc utility. [The scanf version segmentation faulted and I don't feel like troubleshooting it.]:

$ /usr/bin/time cat temp_big_file | readline_test.py
0.03user 2.04system 0:28.06elapsed 7%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
Read 100000000 lines in 28 seconds. LPS: 3571428

$ /usr/bin/time cat temp_big_file | readline_test_unsync_cin
0.03user 1.64system 0:08.10elapsed 20%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
Read 100000000 lines in 8 seconds. LPS: 12500000

$ /usr/bin/time cat temp_big_file | readline_test_fgets
0.00user 0.93system 0:07.01elapsed 13%CPU (0avgtext+0avgdata 2448maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps
Read 100000000 lines in 7 seconds. LPS: 14285714

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
100000000


Recap (lines per second):
python:         3,571,428
cin (no sync): 12,500,000
fgets:         14,285,714
wc:            54,644,808

As you can see, fgets is better, but still pretty far from wc performance; I'm pretty sure this is due to the fact that wc examines each character without any memory copying. I suspect that, at this point, other parts of the code will become the bottleneck, so I don't think optimizing to that level would even be worthwhile, even if possible (since, after all, I actually need to store the read lines in memory).

Also note that a small tradeoff with using a char * buffer and fgets vs. unsynchronised cin to string is that the latter can read lines of any length, while the former requires limiting input to some finite number. In practice, this is probably a non-issue for reading most line-based input files, as the buffer can be set to a very large value that would not be exceeded by valid input.

This has been educational. Thanks to all for your comments and suggestions.

Edit 6:

As suggested by J.F. Sebastian in the comments below, the GNU wc utility uses plain C read() (within the safe-read.c wrapper) to read chunks (of 16k bytes) at a time and count new lines. Here's a Python equivalent based on J.F.'s code (just showing the relevant snippet that replaces the Python for loop:

BUFFER_SIZE = 16384
count = sum(chunk.count('\n') for chunk in iter(partial(sys.stdin.read, BUFFER_SIZE), ''))

The performance of this version is quite fast (though still a bit slower than the raw C wc utility, of course):

$ /usr/bin/time cat temp_big_file | readline_test3.py
0.01user 1.16system 0:04.74elapsed 24%CPU (0avgtext+0avgdata 2448maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps
Read 100000000 lines in 4.7275 seconds. LPS: 21152829

Again, it's a bit silly for me to compare C++ fgets/cin and the first python code on the one hand to wc -l and this last Python snippet on the other, as the latter two don't actually store the read lines, but merely count newlines. Still, it's interesting to explore all the different implementations and think about the performance implications. Thanks again!

Edit 7: Tiny benchmark addendum and recap

For completeness, I thought I'd update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here's the complete table now:

Implementation      Lines per second
python (default)           3,571,428
cin (default/naive)          819,672
cin (no sync)             12,500,000
fgets                     14,285,714
wc (not fair comparison)  54,644,808
Eric
2#
Eric Reply to 2018-01-30 02:38:39Z

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

std::ios_base::sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn't be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method.

Peter Mortensen
3#
Peter Mortensen Reply to 2014-04-02 18:25:14Z

A first element of an answer: <iostream> is slow. Damn slow. I get a huge performance boost with scanf as in the below, but it is still two times slower than Python.

#include <iostream>
#include <time.h>
#include <cstdio>

using namespace std;

int main() {
    char buffer[10000];
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    int read = 1;
    while(read > 0) {
        read = scanf("%s", buffer);
        line_count++;
    };
    sec = (int) time(NULL) - start;
    line_count--;
    cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = line_count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } 
    else
        cerr << endl;
    return 0;
}
davinchi
4#
davinchi Reply to 2012-02-21 03:32:17Z

In your second example (with scanf()) reason why this is still slower might be because scanf("%s") parses string and looks for any space char (space, tab, newline).

Also, yes, CPython does some caching to avoid harddisk reads.

Peter Mortensen
5#
Peter Mortensen Reply to 2014-04-02 18:26:00Z

I reproduced the original result on my computer using g++ on a Mac.

Adding the following statements to the C++ version just before the while loop brings it inline with the Python version:

std::ios_base::sync_with_stdio(false);
char buffer[1048576];
std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

sync_with_stdio improved speed to 2 seconds, and setting a larger buffer brought it down to 1 second.

José Ernesto Lara Rodríguez
6#
José Ernesto Lara Rodríguez Reply to 2017-08-30 00:16:25Z

Well, I see that in your second solution you switched from cin to scanf, which was the first suggestion I was going to make you (cin is sloooooooooooow). Now, if you switch from scanf to fgets, you would see another boost in performance: fgets is the fastest C++ function for string input.

BTW, didn't know about that sync thing, nice. But you should still try fgets.

Gregg
7#
Gregg Reply to 2012-03-12 04:04:01Z

By the way, the reason the line count for the C++ version is one greater than the count for the Python version is that the eof flag only gets set when an attempt is made to read beyond eof. So the correct loop would be:

while (cin) {
    getline(cin, input_line);

    if (!cin.eof())
        line_count++;
};
Peter Mortensen
8#
Peter Mortensen Reply to 2014-04-02 18:27:57Z

Just out of curiosity I've taken a look at what happens under the hood, and I've used dtruss/strace on each test.

C++

./a.out < in
Saw 6512403 lines in 8 seconds.  Crunch speed: 814050

syscalls sudo dtruss -c ./a.out < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            6
pread                                           8
mprotect                                       17
mmap                                           22
stat64                                         30
read_nocancel                               25958

Python

./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402

syscalls sudo dtruss -c ./a.py < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            5
pread                                           8
mprotect                                       17
mmap                                           21
stat64                                         29
Azeem
9#
Azeem Reply to 2018-01-18 11:37:10Z

getline, stream operators, scanf, can be convenient if you don't care about file loading time or if you are loading small text files. But, if the performance is something you care about, you should really just buffer the entire file into memory (assuming it will fit).

Here's an example:

//open file in binary mode
std::fstream file( filename, std::ios::in|::std::ios::binary );
if( !file ) return NULL;

//read the size...
file.seekg(0, std::ios::end);
size_t length = (size_t)file.tellg();
file.seekg(0, std::ios::beg);

//read into memory buffer, then close it.
char *filebuf = new char[length+1];
file.read(filebuf, length);
filebuf[length] = '\0'; //make it null-terminated
file.close();

If you want, you can wrap a stream around that buffer for more convenient access like this:

std::istrstream header(&buffer[0], length);

Also, if you are in control of the file, consider using a flat binary data format instead of text. It's more reliable to read and write because you don't have to deal with all the ambiguities of whitespace. It's also smaller and much faster to parse.

You need to login account before you can post.

About| Privacy statement| Terms of Service| Advertising| Contact us| Help| Sitemap|
Processed in 0.657708 second(s) , Gzip On .

© 2016 Powered by mzan.com design MATCHINFO