Skip to content

Use a larger (or even configurable) buffer size in filecmp for performance #150539

@zahlman

Description

@zahlman

Feature or enhancement

Proposal:

Currently, filecmp.cmp(f1, f2, shallow=False) will iterate over both files in chunks of BUFSIZE = 8 * 1024 bytes (i.e. 8 KiB). Previously there was a rejected suggestion (#83370) to unify this with io.DEFAULT_BUFFER_SIZE, but it appears that the equal value is coincidental.

I propose to use a considerably larger buffer here. The issue is that even if the OS is fundamentally reading the file a page at a time, the overhead of processing the file a page at a time in Python code slows everything down. For example on my system:

$ wc -c testfile.bin 
212699823 testfile.bin

$ for i in $(seq 12 25); do (size=$(echo "2 ^ $i" | bc); echo "Buffer size: $size bytes"; python -m timeit "with open('testfile.bin', 'rb') as f:" "    while f.read($size): pass"; echo); done
Buffer size: 4096 bytes
5 loops, best of 5: 83.8 msec per loop

Buffer size: 8192 bytes
5 loops, best of 5: 52 msec per loop

Buffer size: 16384 bytes
10 loops, best of 5: 38.3 msec per loop

Buffer size: 32768 bytes
10 loops, best of 5: 29.6 msec per loop

Buffer size: 65536 bytes
10 loops, best of 5: 26.3 msec per loop

Buffer size: 131072 bytes
10 loops, best of 5: 25.8 msec per loop

Buffer size: 262144 bytes
10 loops, best of 5: 25.5 msec per loop

Buffer size: 524288 bytes
10 loops, best of 5: 25.1 msec per loop

Buffer size: 1048576 bytes
10 loops, best of 5: 25.1 msec per loop

Buffer size: 2097152 bytes
10 loops, best of 5: 26.1 msec per loop

Buffer size: 4194304 bytes
10 loops, best of 5: 29.1 msec per loop

Buffer size: 8388608 bytes
10 loops, best of 5: 33.2 msec per loop

Buffer size: 16777216 bytes
10 loops, best of 5: 34.4 msec per loop

Buffer size: 33554432 bytes
2 loops, best of 5: 110 msec per loop

Even though the disk/OS page size is 4KiB, reading in 8KiB chunks (as already happens) is faster, but it is consistently faster still (as much as 2x) with chunks up to at least 64KiB and possibly as much as 512KiB. And nowadays these are still tiny amounts of memory to use for a temporary buffer. (For completeness, I include the serious performance impact of using a very large buffer, where it seems Linux has switched to a different strategy for the underlying malloc().)


Rather than just hard-coding a buffer size, it could also be provided as a keyword-only parameter:

def cmp(f1, f2, shallow=True, *, bufsize=BUFSIZE):
    # ...
    if outcome is None:
        outcome = _do_cmp(f1, f2, bufsize)
    # ...

def _do_cmp(f1, f2, bufsize):
   # as before, but without the assignment from the global

But since different designs are possible, I wanted to discuss before submitting a PR.

If there are other places in the standard library that iterate over a file(-like object) in chunks, this should probably be considered there as well.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    pendingThe issue will be closed if no feedback is providedstdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions