PHP File Download Hit Counter

Describes technical hurdles while building a file download hit counter in PHP.

Introduction

My cross-platform, desktop text editing software, KeenWrite, bundles: a Java Runtime Environment; JavaFX; and Renjin. While this eases cross-platform development and gives the editor the ability to produce living documents from external data, the self-extracting binaries are hefty. One downside to moving from GitHub to GitLab is that the latter limits file uploads associated with releases. Meaning, KeenWrite is too big to host on GitLab.

As a consequence, there’s no longer a way to display a download count. This post, then, walks through creating a transparent download hit counter for files hosted on a static web server in PHP, without using a database or JavaScript.

To see it in action, visit KeenWrite’s homepage and download a version for your operating system, then refresh the homepage.

Audience

Readers will need to know PHP, HTML, regular expressions, bash, AWK, Server-Side Includes (SSI), and using .htaccess files. Whether you’re starting out with PHP, have used it for a few years, or want to track file downloads, then this post is for you.

Requirements

Follow along using:

Set up and configure the web server to execute both .shtml and .php files.

Behaviour

Generally, I approach problems by writing down the big picture items. In this case, we have:

From such a small idea (count binary file downloads), a cornucopia of problems spill onto our plate, including:

There’s a lot to unpack, so let’s take the issues one by one.

Setup

Before we get into the nitty gritty details, there are a few PHP-centric items we’ll want to address.

Source file format

When using PHP, blank lines or spaces before or after the code blocks may interfere with sending headers. Take care when creating a file named count.php so that the contents have no extra whitespace:

<?php
?>

Place the file in a downloads directory on the web server.

We’ll continue adding to the source file as we go.

Diagnostics

Sometimes web sites will burp PHP errors, often due to database problems. Redirecting errors to a log file avoids exposing such details to end users:

<?php
  ini_set( 'log_errors', 1 );
  ini_set( 'error_log', '/tmp/php-errors.log' );

Monitor /tmp/php-errors.log every so often to make sure no unexpected behaviour is present in the code.

House cleaning

PHP has some quirks that arise from a stateless client-server model. Such as addressing denial-of-service attacks by closing the connection after a period of time. To detect cancelled downloads, we’ll need to make sure the script can run to completion. This entails setting an infinite timeout and ignoring connections terminated from the client-side.

We also want to make sure PHP’s output buffer is closed and that we can maintain a session across separate download requests. That session information allows detecting when the same user makes multiple download requests within a certain time period. A session is typically maintained using a short cookie sent to the client from the server.

  set_time_limit( 0 );

  while( ob_get_level() > 0 ) {
    ob_end_flush();
  }   
    
  if( session_id() === "" ) {
    session_start();
  } 

  ignore_user_abort( true );

Algorithm

With setup out of the way, let’s return to the high-level algorithm, which is reasonably straightforward:

  1. Obtain the file name to download.
  2. Transmit the file to the client.
  3. Increment the counter for a unique hit.

We’ll codify this as follows:

  $filename = get_sanitized_filename();
  $valid_file = !empty( $filename );
  $expiry = 24 * 60 * 60;
        
  if( $valid_file && download( $filename ) && token_expired( $expiry ) ) {
    increment_count( "$filename-count.txt" );
  }

A key to making this work for multiple files is to use the requested file name with a -count.txt suffix. This means adding up all the download counts before we’re through.

Let’s break each function down.

Sanitized file names

Sanitize functions are needed for file names or paths provided by a user to decrease the attack surfaces of an application. For our purposes, since we control the names of files being offered for download, we’ll make sure that any file name provided will match our own conventions. To simplify the problem, we’ll reference only files located in the same directory as the script.

Here’s a function that sanitizes the file names:

  function get_sanitized_filename() {
    $filepath = isset( $_GET[ 'filename' ] ) ? $_GET[ 'filename' ] : '';
    $fileinfo = pathinfo( $filepath );
    $basename = $fileinfo[ 'basename' ];

    if( isset( $_SERVER[ 'HTTP_USER_AGENT' ] ) ) {
      $periods = substr_count( $basename, '.' );

      $basename = strstr( $_SERVER[ 'HTTP_USER_AGENT' ], 'MSIE' )
        ? mb_ereg_replace( '/\./', '%2e', $basename, $periods - 1 )
        : $basename;
    }

    $basename = mb_ereg_replace( '/\s+/', '', $basename );
    $basename = mb_ereg_replace( '([^\w\d\-_~,;\[\]\(\).])', '', $basename );

    return $basename;
  }

Even though double dots in file names are allowed on some operating systems, we remove them for IE users. Note that if the file on the web server doesn’t pass through this method unscathed, then the file name will have to be updated before it can be downloaded. That does mean removing spaces and double dots from file names.

Download

We need to make sure that regardless of whatever else happens, if the file name exists and the user requested it, the transmission starts. That’s why we only check whether there have been multiple requests by the same user after sending the file.

Transferring a file has many problems to solve:

On that last point, we cannot know with certainty that the client received the full file. TCP/IP is reliable and robust, but not perfect. However, we’re only counting download hits, not creating life support machinery. Miscounts once in a blue moon are tolerable.

Thus the download function:

  function download( $filename ) {
    clearstatcache();

    $size = @filesize( $filename );
    $size = $size === false || empty( $size ) ? 0 : $size;
    $content_type = mime_content_type( $filename );
    list( $seek_start, $content_length ) = parse_range( $size );

    header_remove( 'x-powered-by' );
    header( 'Expires: 0' );
    header( 'Cache-Control: public, must-revalidate, post-check=0, pre-check=0' );
    header( 'Cache-Control: private', false );
    header( "Content-Disposition: attachment; filename=\"$filename\"" );
    header( 'Accept-Ranges: bytes' );
    header( "Content-Length: $content_length" );
    header( "Content-Type: $content_type" );

    $method = isset( $_SERVER[ 'REQUEST_METHOD' ] )
      ? $_SERVER[ 'REQUEST_METHOD' ]
      : 'GET';
    
    return $method === 'HEAD'
      ? false
      : transmit( $filename, $seek_start, $size );
  }

In software, functions that are around 20 lines long tend to be easier to understand than longer functions. Further, functions that are easier to understand tend to have fewer bugs.

The download function is responsible for parsing a range of bytes requested by the client and sending those bytes to the client. It will also short-circuit that logic to honour HTTP HEAD requests, which means only sending the HTTP header attributes (and not the file contents).

Resume downloads

To resume a paused or discontinued download, the client must request that retransmission begin at a certain offset into the file. The full specification for handling partial downloads allows requesting multiple ranges. We won’t implement the full specification because it bloats the code significantly and isn’t necessary in most situations.

If a range was given but it doesn’t conform to the standard, we’ll tell the client and terminate the script immediately. On the same note, if the client specifies an impossible range, the call to fseek later will fail and we’ll simply try to transmit the entire file instead.

Parsing the range entails extracting the offset into the downloaded file to begin transmitting and how much data to send. We’ll return both of these pieces as an array.

  function parse_range( $size ) {
    $seek_start = 0;
    $content_length = $size;

    if( isset( $_SERVER[ 'HTTP_RANGE' ] ) ) {
      $range_format = '/^bytes=\d*-\d*(,\d*-\d*)*$/';
      $request_range = $_SERVER[ 'HTTP_RANGE' ];

      if( !preg_match( $range_format, $request_range, $matches ) ) {
        header( 'HTTP/1.1 416 Requested Range Not Satisfiable' );
        header( "Content-Range: bytes */$size" );

        exit;
      }

      $seek_start = isset( $matches[ 1 ] ) ? $matches[ 1 ] + 0 : 0;
      $seek_end = isset( $matches[ 2 ] ) ? $matches[ 2 ] + 0 : $size - 1;
      $range_bytes = $seek_start . '-' . $seek_end . '/' . $size;
      $content_length = $seek_end - $seek_start + 1;

      header( 'HTTP/1.1 206 Partial Content' );
      header( "Content-Range: bytes $range_bytes" );
    }

    return array( $seek_start, $content_length );
  }

For anyone new to PHP, note the guard calls to isset prior to accessing an array element. This allows us to safely retrieve an array value, even if the array isn’t fully initialized or a value is missing. The ternary operator is a syntactically terse way to provide a default value.

Transmission

The most technically challenging part comes next: sending the file. There’s a lot that can go wrong here. Essentially, we want to do the following:

Here’s one way to accomplish this:

  function transmit( $filename, $seek_start, $size ) {
    if( ob_get_level() == 0 ) {
      ob_start();
    }

    $bytes_sent = -1;

    $fp = @fopen( $filename, 'rb' );

    if( $fp !== false ) {
      @fseek( $fp, $seek_start );

      $aborted = false;
      $bytes_sent = $seek_start;
      $chunk_size = 1024 * 16;

      while( !feof( $fp ) && !$aborted ) {
        print( @fread( $fp, $chunk_size ) );
        $bytes_sent += $chunk_size;

        if( ob_get_level() > 0 ) {
          ob_flush();
        }

        flush();

        $aborted = connection_aborted() || connection_status() != 0;
      }

      if( ob_get_level() > 0 ) {
        ob_end_flush();
      }

      fclose( $fp );
    }

    return $bytes_sent >= $size;
  }

The @ symbol allows for silent failures. We set $bytes_sent to -1 because if the file doesn’t exist, we don’t want to count it as a download. The 'rb' when opening means to open the file read-only as a binary file. Entering a loop allows us to check the connection status while transferring the file. When the loop terminates, we force sending any remaining bytes and close the file. We presume the download was successful if all the bytes were transmitted.

Before the main transmit loop iterates, we check to see if the connection is alive. Again, this doesn’t mean that the bytes will be received, only that we sent them while the client connection was active. Close enough.

One possible bottleneck with downloading is the chunk size. Here we trade speed for accuracy in cancellation detection. Generally speaking, a file download counter is probably going to be used to count hits against files greater than 16 kilobytes. If you know you’re serving smaller files, you may want to tweak the chunk size.

Detect duplicate downloads

Let’s define a duplicate download as a request to retrieve one or more files more that once by the same client within a 24-hour period. This implies that if you have a multi-platform application, when a user completes a download for any one particular platform, downloads for other platforms aren’t counted.

One way to avoid double counting is to use a session variable. As stated earlier, a session variable uses a cookie on the client-side. If a user clears cookies, then any subsequent download request by that same user will result in a double count. Without tracking the IP addresses, there’s not much that can be done about users emptying their cookie jars.

The following function creates a download token that expires after a given lifetime. As we saw earlier, the lifetime is given as 24 * 60 * 60, which is the number of seconds in a day. The time() function returns seconds since the epoch, so all our units must be in seconds. The LAST_DOWNLOAD token tracks when the user last downloaded the file; the CREATED token helps avoid session fixation:

  function token_expired( $lifetime ) {
    $TOKEN_NAME = 'LAST_DOWNLOAD';
    $now = time();
    $expired = !isset( $_SESSION[ $TOKEN_NAME ] );

    if( !$expired && ($now - $_SESSION[ $TOKEN_NAME ] > $lifetime) ) {
      $expired = true;
      $_SESSION = array();

      session_destroy();
    }

    $_SESSION[ $TOKEN_NAME ] = $now;

    $TOKEN_CREATE = 'CREATED';

    if( !isset( $_SESSION[ $TOKEN_CREATE ] ) ) {
      $_SESSION[ $TOKEN_CREATE ] = $now;
    }
    else if( $now - $_SESSION[ $TOKEN_CREATE ] > $lifetime ) {
      session_regenerate_id( true );
      $_SESSION[ $TOKEN_CREATE ] = $now;
    }

    return $expired;
  }

Increment counter

Many other approaches to hit counters suggest that, minimally, all that’s necessary is to read a number from a file, increment it, and write it back. Setting everything that has been coded to this point aside, such a simple approach would work up until simultaneous download requests occur at a rate faster than one per millisecond or so. At that point, a race condition ensues and the counter won’t increment accurately.

When separated into small functions, the solution nearly writes itself: Open a lock, read the count, write the new count, and close the lock. In code:

  function increment_count( $filename ) {
    try {
      lock_open( $filename );

      // Coerce value to largest natural numeric data type.
      $count = @file_get_contents( $filename ) + 0;

      // Write the new counter value.
      file_put_contents( $filename, $count + 1 );
    }
    finally {
      lock_close( $filename );
    }
  }

Note how the implementation details of what it means to hold and release the lock isn’t a concern of the function that increments the counter. All this function cares about is that reading and writing the counter are mutually exclusive operations performed atomically.

Exclusive locks

Moving on, flock is slow and its read/write operations are not mutually exclusive. Instead, we’ll use the mkdir command because it is an atomic operation. The locking code is not straightforward and using flock is definitely easier. Nevertheless, we have:

  function lock_open( $filename ) {
    $lockdir = create_lock_filename( $filename );

    $iterations = 0;

    do {
      if( @mkdir( $lockdir, 0777 ) ) {
        $iterations = 0;
      }
      else {
        $iterations++;
        $lifetime = time() - filemtime( $lockdir );

        if( $lifetime > 10 ) {
          @rmdir( $lockdir );
        }
        else {
          usleep( rand( 1000, 10000 ) );
        }
      }
    }
    while( $iterations > 0 && $iterations < 10 );

    return $iterations == 0;
  }

The implementation follows the principle that all loops must have a deterministic termination condition. In this case, acquiring the lock will make at most 10 attempts. Failing to acquire the lock is fine for a hit counter, not for a rocket engine system sending astronauts into orbit.

That leaves us with releasing the lock and creating the directory name that’s unique for the files to be downloaded:

  function lock_close( $filename ) {
    @rmdir( create_lock_filename( $filename ) );
  }

  function create_lock_filename( $filename ) {
    return $filename .'.lock';
  }
?>

Integration

With our download hit counter code complete, we’re set to integrate it into a static web site. Only a few issues remain:

Since you’re still reading, we’ll solve those, too.

Alongside the PHP script in the downloads directory, create a file named .htaccess that contains the following instructions:

SetEnv no-gzip dont-vary

<IfModule mod_rewrite.c>
RewriteEngine On

# Ensure the file exists before attemping to download it.
RewriteCond %{REQUEST_FILENAME} -f

# Rewrite requests for file extensions to track.
RewriteRule ^([^/]+\.(zip|app|bin|exe|jar))$ counter.php?filename=$1 [L]
</IfModule>

That takes care of the first two problems. Notice that we only run the hit counter for specific file name extensions. If you’re serving up other types of files, you’ll need to modify the extensions accordingly.

The HTML page for the downloads can be a plain old anchor link; the server-side counting is completely transparent:

<a href="downloads/filename.bin"
   title="Download for 64-bit Linux (x86)"
   aria-label="Download for Linux"><img
     src="images/icons/download.svg"
     alt="Download for Linux"></a>

Change the attributes to your paths, images, descriptions, and alt texts.

Display counts

Finally! We want to display the impressive total download count to end users so that they feel confident that they aren’t the only ones running the software they’ve downloaded. Using SSI, such as index.shtml, we run a script that performs the calculation:

Downloaded a whopping <!--#exec cmd="./count.sh" --> times!

Knowing that the running total files end with -count.txt, we can create a shell script named count.sh in the same directory as index.shtml to sum the totals. The script contents contain:

#!/usr/bin/env bash

awk '{s+=$1} END {print s}' downloads/*-count.txt 2> /dev/null || echo 0

This is reasonably performant for a limited number of downloads.

Robots

Q: Do we care about robots artificially inflating our hit counts?

A: Yes.

Summary

This isn’t a simple hit counter; rather, this is a fairly robust hit counter for the modern web that works well in practice to track the number of times various files have been downloaded. The approach avoids both JavaScript and a relational database, which can be considered an HTML-first design.

Download the fully commented source code.

Contact

About the Author

My career has spanned tele- and radio communications, enterprise-level e-commerce solutions, finance, transportation, modernization projects in both health and education, and much more.

Delighted to discuss opportunities to work with revolutionary companies combatting climate change.