PHP File Download Hit Counter
Describes technical hurdles while building a file download hit counter in PHP.
Introduction
My cross-platform, desktop text editing software, KeenWrite, bundles: a Java Runtime Environment; JavaFX; and Renjin. While this eases cross-platform development and gives the editor the ability to produce living documents from external data, the self-extracting binaries are hefty. One downside to moving from GitHub to GitLab is that the latter limits file uploads associated with releases. Meaning, KeenWrite is too big to host on GitLab.
As a consequence, there’s no longer a way to display a download count. This post, then, walks through creating a transparent download hit counter for files hosted on a static web server in PHP, without using a database or JavaScript.
To see it in action, visit KeenWrite’s homepage and download a version for your operating system, then refresh the homepage.
Audience
Readers will need to know PHP, HTML, regular expressions, bash, AWK, Server-Side Includes (SSI), and using .htaccess
files. Whether you’re starting out with PHP, have used it for a few years, or want to track file downloads, then this post is for you.
Requirements
Follow along using:
- Linux
- Apache httpd with SSI
- PHP
Set up and configure the web server to execute both .shtml
and .php
files.
Behaviour
Generally, I approach problems by writing down the big picture items. In this case, we have:
- Create a download link using regular HTML (no JavaScript).
- When clicked, transparently run a PHP script to transfer a file.
- Upon download completion, increment a counter.
- Have the script avoid counting multiple downloads from the same user.
- Display the total sum of counts on a web page.
From such a small idea (count binary file downloads), a cornucopia of problems spill onto our plate, including:
- Log to a file instead of the browser
- Avoid duplicates using sessions
- Trap session fixation
- Sanitize file names from user input
- Transmit files to be downloaded
- Ensure partial downloads are not counted
- Allow partial downloads to be resumed
- Prevent race conditions when counting
- Honour HTTP HEAD requests
- Control client-side caching
- Mitigate server-side file size caching
- Suppress compression of HTTP headers
There’s a lot to unpack, so let’s take the issues one by one.
Setup
Before we get into the nitty gritty details, there are a few PHP-centric items we’ll want to address.
Source file format
When using PHP, blank lines or spaces before or after the code blocks may interfere with sending headers. Take care when creating a file named count.php
so that the contents have no extra whitespace:
<?php
?>
Place the file in a downloads
directory on the web server.
We’ll continue adding to the source file as we go.
Diagnostics
Sometimes web sites will burp PHP errors, often due to database problems. Redirecting errors to a log file avoids exposing such details to end users:
<?php
ini_set( 'log_errors', 1 );
ini_set( 'error_log', '/tmp/php-errors.log' );
Monitor /tmp/php-errors.log
every so often to make sure no unexpected behaviour is present in the code.
House cleaning
PHP has some quirks that arise from a stateless client-server model. Such as addressing denial-of-service attacks by closing the connection after a period of time. To detect cancelled downloads, we’ll need to make sure the script can run to completion. This entails setting an infinite timeout and ignoring connections terminated from the client-side.
We also want to make sure PHP’s output buffer is closed and that we can maintain a session across separate download requests. That session information allows detecting when the same user makes multiple download requests within a certain time period. A session is typically maintained using a short cookie sent to the client from the server.
set_time_limit( 0 );
while( ob_get_level() > 0 ) {
ob_end_flush();
}
if( session_id() === "" ) {
session_start();
}
ignore_user_abort( true );
Algorithm
With setup out of the way, let’s return to the high-level algorithm, which is reasonably straightforward:
- Obtain the file name to download.
- Transmit the file to the client.
- Increment the counter for a unique hit.
We’ll codify this as follows:
$filename = get_sanitized_filename();
$valid_file = !empty( $filename );
$expiry = 24 * 60 * 60;
if( $valid_file && download( $filename ) && token_expired( $expiry ) ) {
increment_count( "$filename-count.txt" );
}
A key to making this work for multiple files is to use the requested file name with a -count.txt
suffix. This means adding up all the download counts before we’re through.
Let’s break each function down.
Sanitized file names
Sanitize functions are needed for file names or paths provided by a user to decrease the attack surfaces of an application. For our purposes, since we control the names of files being offered for download, we’ll make sure that any file name provided will match our own conventions. To simplify the problem, we’ll reference only files located in the same directory as the script.
Here’s a function that sanitizes the file names:
function get_sanitized_filename() {
$filepath = isset( $_GET[ 'filename' ] ) ? $_GET[ 'filename' ] : '';
$fileinfo = pathinfo( $filepath );
$basename = $fileinfo[ 'basename' ];
if( isset( $_SERVER[ 'HTTP_USER_AGENT' ] ) ) {
$periods = substr_count( $basename, '.' );
$basename = strstr( $_SERVER[ 'HTTP_USER_AGENT' ], 'MSIE' )
? mb_ereg_replace( '/\./', '%2e', $basename, $periods - 1 )
: $basename;
}
$basename = mb_ereg_replace( '/\s+/', '', $basename );
$basename = mb_ereg_replace( '([^\w\d\-_~,;\[\]\(\).])', '', $basename );
return $basename;
}
Even though double dots in file names are allowed on some operating systems, we remove them for IE users. Note that if the file on the web server doesn’t pass through this method unscathed, then the file name will have to be updated before it can be downloaded. That does mean removing spaces and double dots from file names.
Download
We need to make sure that regardless of whatever else happens, if the file name exists and the user requested it, the transmission starts. That’s why we only check whether there have been multiple requests by the same user after sending the file.
Transferring a file has many problems to solve:
- Prevent file size caching so new copies with the same name can be sent.
- Disable content caching so that new versions may be downloaded.
- Determine the mime type to transmit the file’s content type.
- Send mandatory HTTP headers to the client.
- Honour HTTP HEAD requests.
- Transmit the file to the client (often a web browser).
- Allow the client to resume partial downloads.
- Indicate whether the download completed successfully.
On that last point, we cannot know with certainty that the client received the full file. TCP/IP is reliable and robust, but not perfect. However, we’re only counting download hits, not creating life support machinery. Miscounts once in a blue moon are tolerable.
Thus the download function:
function download( $filename ) {
clearstatcache();
$size = @filesize( $filename );
$size = $size === false || empty( $size ) ? 0 : $size;
$content_type = mime_content_type( $filename );
list( $seek_start, $content_length ) = parse_range( $size );
header_remove( 'x-powered-by' );
header( 'Expires: 0' );
header( 'Cache-Control: public, must-revalidate, post-check=0, pre-check=0' );
header( 'Cache-Control: private', false );
header( "Content-Disposition: attachment; filename=\"$filename\"" );
header( 'Accept-Ranges: bytes' );
header( "Content-Length: $content_length" );
header( "Content-Type: $content_type" );
$method = isset( $_SERVER[ 'REQUEST_METHOD' ] )
? $_SERVER[ 'REQUEST_METHOD' ]
: 'GET';
return $method === 'HEAD'
? false
: transmit( $filename, $seek_start, $size );
}
In software, functions that are around 20 lines long tend to be easier to understand than longer functions. Further, functions that are easier to understand tend to have fewer bugs.
The download function is responsible for parsing a range of bytes requested by the client and sending those bytes to the client. It will also short-circuit that logic to honour HTTP HEAD requests, which means only sending the HTTP header attributes (and not the file contents).
Resume downloads
To resume a paused or discontinued download, the client must request that retransmission begin at a certain offset into the file. The full specification for handling partial downloads allows requesting multiple ranges. We won’t implement the full specification because it bloats the code significantly and isn’t necessary in most situations.
If a range was given but it doesn’t conform to the standard, we’ll tell the client and terminate the script immediately. On the same note, if the client specifies an impossible range, the call to fseek
later will fail and we’ll simply try to transmit the entire file instead.
Parsing the range entails extracting the offset into the downloaded file to begin transmitting and how much data to send. We’ll return both of these pieces as an array.
function parse_range( $size ) {
$seek_start = 0;
$content_length = $size;
if( isset( $_SERVER[ 'HTTP_RANGE' ] ) ) {
$range_format = '/^bytes=\d*-\d*(,\d*-\d*)*$/';
$request_range = $_SERVER[ 'HTTP_RANGE' ];
if( !preg_match( $range_format, $request_range, $matches ) ) {
header( 'HTTP/1.1 416 Requested Range Not Satisfiable' );
header( "Content-Range: bytes */$size" );
exit;
}
$seek_start = isset( $matches[ 1 ] ) ? $matches[ 1 ] + 0 : 0;
$seek_end = isset( $matches[ 2 ] ) ? $matches[ 2 ] + 0 : $size - 1;
$range_bytes = $seek_start . '-' . $seek_end . '/' . $size;
$content_length = $seek_end - $seek_start + 1;
header( 'HTTP/1.1 206 Partial Content' );
header( "Content-Range: bytes $range_bytes" );
}
return array( $seek_start, $content_length );
}
For anyone new to PHP, note the guard calls to isset
prior to accessing an array element. This allows us to safely retrieve an array value, even if the array isn’t fully initialized or a value is missing. The ternary operator is a syntactically terse way to provide a default value.
Transmission
The most technically challenging part comes next: sending the file. There’s a lot that can go wrong here. Essentially, we want to do the following:
- Start buffering the output.
- Open the file and read from a given starting offset.
- Send the file contents as quickly as possible.
- Monitor the connection for failures.
- Answer whether the file was fully downloaded.
Here’s one way to accomplish this:
function transmit( $filename, $seek_start, $size ) {
if( ob_get_level() == 0 ) {
ob_start();
}
$bytes_sent = -1;
$fp = @fopen( $filename, 'rb' );
if( $fp !== false ) {
@fseek( $fp, $seek_start );
$aborted = false;
$bytes_sent = $seek_start;
$chunk_size = 1024 * 16;
while( !feof( $fp ) && !$aborted ) {
print( @fread( $fp, $chunk_size ) );
$bytes_sent += $chunk_size;
if( ob_get_level() > 0 ) {
ob_flush();
}
flush();
$aborted = connection_aborted() || connection_status() != 0;
}
if( ob_get_level() > 0 ) {
ob_end_flush();
}
fclose( $fp );
}
return $bytes_sent >= $size;
}
The @
symbol allows for silent failures. We set $bytes_sent
to -1
because if the file doesn’t exist, we don’t want to count it as a download. The 'rb'
when opening means to open the file read-only as a binary file. Entering a loop allows us to check the connection status while transferring the file. When the loop terminates, we force sending any remaining bytes and close the file. We presume the download was successful if all the bytes were transmitted.
Before the main transmit loop iterates, we check to see if the connection is alive. Again, this doesn’t mean that the bytes will be received, only that we sent them while the client connection was active. Close enough.
One possible bottleneck with downloading is the chunk size. Here we trade speed for accuracy in cancellation detection. Generally speaking, a file download counter is probably going to be used to count hits against files greater than 16 kilobytes. If you know you’re serving smaller files, you may want to tweak the chunk size.
Detect duplicate downloads
Let’s define a duplicate download as a request to retrieve one or more files more that once by the same client within a 24-hour period. This implies that if you have a multi-platform application, when a user completes a download for any one particular platform, downloads for other platforms aren’t counted.
One way to avoid double counting is to use a session variable. As stated earlier, a session variable uses a cookie on the client-side. If a user clears cookies, then any subsequent download request by that same user will result in a double count. Without tracking the IP addresses, there’s not much that can be done about users emptying their cookie jars.
The following function creates a download token that expires after a given lifetime. As we saw earlier, the lifetime is given as 24 * 60 * 60
, which is the number of seconds in a day. The time()
function returns seconds since the epoch, so all our units must be in seconds. The LAST_DOWNLOAD
token tracks when the user last downloaded the file; the CREATED
token helps avoid session fixation:
function token_expired( $lifetime ) {
$TOKEN_NAME = 'LAST_DOWNLOAD';
$now = time();
$expired = !isset( $_SESSION[ $TOKEN_NAME ] );
if( !$expired && ($now - $_SESSION[ $TOKEN_NAME ] > $lifetime) ) {
$expired = true;
$_SESSION = array();
session_destroy();
}
$_SESSION[ $TOKEN_NAME ] = $now;
$TOKEN_CREATE = 'CREATED';
if( !isset( $_SESSION[ $TOKEN_CREATE ] ) ) {
$_SESSION[ $TOKEN_CREATE ] = $now;
}
else if( $now - $_SESSION[ $TOKEN_CREATE ] > $lifetime ) {
session_regenerate_id( true );
$_SESSION[ $TOKEN_CREATE ] = $now;
}
return $expired;
}
Increment counter
Many other approaches to hit counters suggest that, minimally, all that’s necessary is to read a number from a file, increment it, and write it back. Setting everything that has been coded to this point aside, such a simple approach would work up until simultaneous download requests occur at a rate faster than one per millisecond or so. At that point, a race condition ensues and the counter won’t increment accurately.
When separated into small functions, the solution nearly writes itself: Open a lock, read the count, write the new count, and close the lock. In code:
function increment_count( $filename ) {
try {
lock_open( $filename );
// Coerce value to largest natural numeric data type.
$count = @file_get_contents( $filename ) + 0;
// Write the new counter value.
file_put_contents( $filename, $count + 1 );
}
finally {
lock_close( $filename );
}
}
Note how the implementation details of what it means to hold and release the lock isn’t a concern of the function that increments the counter. All this function cares about is that reading and writing the counter are mutually exclusive operations performed atomically.
Exclusive locks
Moving on, flock
is slow and its read/write operations are not mutually exclusive. Instead, we’ll use the mkdir
command because it is an atomic operation. The locking code is not straightforward and using flock
is definitely easier. Nevertheless, we have:
function lock_open( $filename ) {
$lockdir = create_lock_filename( $filename );
$iterations = 0;
do {
if( @mkdir( $lockdir, 0777 ) ) {
$iterations = 0;
}
else {
$iterations++;
$lifetime = time() - filemtime( $lockdir );
if( $lifetime > 10 ) {
@rmdir( $lockdir );
}
else {
usleep( rand( 1000, 10000 ) );
}
}
}
while( $iterations > 0 && $iterations < 10 );
return $iterations == 0;
}
The implementation follows the principle that all loops must have a deterministic termination condition. In this case, acquiring the lock will make at most 10 attempts. Failing to acquire the lock is fine for a hit counter, not for a rocket engine system sending astronauts into orbit.
That leaves us with releasing the lock and creating the directory name that’s unique for the files to be downloaded:
function lock_close( $filename ) {
@rmdir( create_lock_filename( $filename ) );
}
function create_lock_filename( $filename ) {
return $filename .'.lock';
}
?>
Integration
With our download hit counter code complete, we’re set to integrate it into a static web site. Only a few issues remain:
- Apache http daemon may compress HTTP headers.
- The download count is to be transparent to the user.
- We want a sum of all the separate downloaded files.
Since you’re still reading, we’ll solve those, too.
Plain old hyperlinks
Alongside the PHP script in the downloads
directory, create a file named .htaccess
that contains the following instructions:
SetEnv no-gzip dont-vary
<IfModule mod_rewrite.c>
RewriteEngine On
# Ensure the file exists before attemping to download it.
RewriteCond %{REQUEST_FILENAME} -f
# Rewrite requests for file extensions to track.
RewriteRule ^([^/]+\.(zip|app|bin|exe|jar))$ counter.php?filename=$1 [L]
</IfModule>
That takes care of the first two problems. Notice that we only run the hit counter for specific file name extensions. If you’re serving up other types of files, you’ll need to modify the extensions accordingly.
The HTML page for the downloads can be a plain old anchor link; the server-side counting is completely transparent:
<a href="downloads/filename.bin"
title="Download for 64-bit Linux (x86)"
aria-label="Download for Linux"><img
src="images/icons/download.svg"
alt="Download for Linux"></a>
Change the attributes to your paths, images, descriptions, and alt texts.
Display counts
Finally! We want to display the impressive total download count to end users so that they feel confident that they aren’t the only ones running the software they’ve downloaded. Using SSI, such as index.shtml
, we run a script that performs the calculation:
Downloaded a whopping <!--#exec cmd="./count.sh" --> times!
Knowing that the running total files end with -count.txt
, we can create a shell script named count.sh
in the same directory as index.shtml
to sum the totals. The script contents contain:
#!/usr/bin/env bash
awk '{s+=$1} END {print s}' downloads/*-count.txt 2> /dev/null || echo 0
This is reasonably performant for a limited number of downloads.
Robots
Q: Do we care about robots artificially inflating our hit counts?
A: Yes.
Summary
This isn’t a simple hit counter; rather, this is a fairly robust hit counter for the modern web that works well in practice to track the number of times various files have been downloaded. The approach avoids both JavaScript and a relational database, which can be considered an HTML-first design.
Download the fully commented source code.
Contact
About the Author
My career has spanned tele- and radio communications, enterprise-level e-commerce solutions, finance, transportation, modernization projects in both health and education, and much more.
Delighted to discuss opportunities to work with revolutionary companies combatting climate change.