Beginner's Guide To Data Sanitization And Validation In WordPress

One of the biggest differences between code written by novice developers and more advanced developers is that experienced developers tend to pay a lot more attention towards sanitizing and validating inputs, and late escaping outputs.

Novice developers, on the other hand, tend to overlook these practices. I know I did. I didn’t even know what sanitization, validation, or escaping were. And, while these steps may not seem exciting, they are essential for preventing errors and security exploits on the sites that are running your code.

In this article, I will discuss the basics of sanitizing and validating inputs in the context of WordPress plugin and theme development. My goal is not to provide an exhaustive guide on the topic. Instead, I want to teach you the basics, help you understand what you need to be concerned about, and show you resources to help you learn more. I will be following this article up with a similar article on properly escaping data in WordPress.

As I said before, this is not information I was aware of when I was just starting out as a WordPress developer. Incorporating these practices into your workflow is an important step towards becoming a more seasoned developer.

I’m aware that following these best practices adds time to your job. If you’re new to WordPress development, it’s likely that you’re not charging enough for your time. You can’t make “validating and sanitizing inputs” a line item on your invoice. You just have to either charge more for your project, or if you charge by the hour, estimate for more hours.

Trust No One And Nothing

Validation, generally speaking, is the process of ensuring that the data we are about to work with both exists and is what we expect it to be. Sanitization, in general, is the process of preparing data to be sent to the database and ensuring it is safe to be entered.

As a developer, you should never make assumptions about inputs. Assuming that the content of all requests to your site is properly formatted or does not have malicious intent is poor practice and should be avoided at all costs.

Trust no requests, trust no data already in the database, and definitely don’t trust yourself.

Client-Side Vs. Server-Side Validation

When it comes to validation, form-submission is a major part of the discussion. It’s one of the most common ways that data goes from a client — a browser, app, script or whatever else the end-user or bot is using to make requests to your site — into WordPress. Forms, which may be in the front-end or back-end, should have validation. The form should be designed to prevent the submission of incomplete or malformed data.

Form validation and other types of validation in the browser is no substitute for server-side validation. Every endpoint on your site where you accept request input from an HTTP request needs validation, because even if it is supposed to be used only with a form, that is not the only way data can be submitted to that endpoint.

Client-side validation is an essential part of front-end design, but it is separate from your responsibilities in terms of the code running on the server before passing inputs to the database.

Validation Part One: Is The Request Authorized?

Validation can be thought of as a series of questions. The first question you need to ask yourself is whether or not the request is even allowed. Input, which is otherwise valid and safe from an unauthorized user, is still a major security issue if it’s allowed in.

For the most part, this is determined by whether or not the current user has a specific capability as defined by WordPress. In addition, most of the time, we want to prevent submissions that do not originate from our site. This is not always the case, however, when it is, nonces and checking the request referrer are good, but not perfect ways to prevent cross-site forgery attacks.

That is why the first step in processing all requests should be to check if a nonce is present and if so, to verify it. Here is an example where we check that a POST request has a key called _wpnonce and, only if does, do we proceed to check if the current user has a capability:

if( isset( $_POST[ '_wpnonce' ] ) && wp_verify_nonce( $_POST[ '_wpnonce' ] ) ){
    if( current_user_can( 'some_capability' ) ){
 
    }
}

Both auth cookies and nonces can be intercepted. Since nonces that are created for logged-in users are tied to the user ID they were created for, using one prevents mismatched cookies and nonces from being used. In addition, since nonces have a limited lifespan before they become invalid — 12 hours, by default — the usefulness of an intercepted nonce is minimal.

Validation Part Two: Is The Correct Data Present?

Imagine a request that is designed to modify two meta fields of a post. That is only going to work if the fields we expect to pass are present. So before further processing a request, we want to make sure we have the right data.

In the context of a form submission or other data coming from a client, we need to make sure that the expected fields in the GET or POST super global are set. Here is a simple example, adding to the example from the last section, which checks that the data we need is set, using the function isset():

if( isset( $_POST[ '_wpnonce' ] ) && wp_verify_nonce( $_POST[ '_wpnonce' ] ) ){
    if( current_user_can( 'some_capability' ) ) {
        if( isset( $_POST[ 'number_of_thing' ] )  ){
 
        }
    }
}

This same principle applies to any function in the global scope that assumes any structure for the variables passed into it. For example, consider this function:

function slug_something( $post_id, $data ){
 
    update_post_meta( $post_id, 'foo', $data[ 'foo' ] );
    update_post_meta( $post_id, 'bar', $data[ 'bar' ] );
 
}

This works perfectly if $post_id is the ID of a post and the variable $data is an array with the keys ‘foo’ and ‘bar’. If not, however, the end result is not going to be as expected. The performance will be affected and PHP notices may be displayed.

Instead, we must ensure that our inputs are valid before using update_post_meta:

function slug_something( $post_id, $data ){
 
    if ( isset( $data[ 'bar' ], $data[ 'foo' ] ) ) {
        update_post_meta( $post_id, 'foo', $data[ 'foo' ] );
        update_post_meta( $post_id, 'bar', $data[ 'bar' ] );
    }
 
}

Notice that I chose not to check if $post_id is a valid post ID. This is because WordPress does it for me. Depending on how this function is used, however, it might make sense to check, since returning an error if a conditional like if ( is_object( get_post( $post_id ) ) fails would be useful.

Validation Part Three: Is The Input Correctly Formatted?

This third question has two parts. The first, which we’ve actually started to discuss in the last section, concerns the type and format of the input. PHP is a dynamically typed language.

PHP is a dynamically typed language. In PHP, variables have types (object, string, integer, float, array, etc.) but those types do not have to be explicitly declared. Because it is dynamically-typed, a variable can change its type. In PHP, there is nothing wrong with this code:

$foo = 'bar';
$foo = array( 'bar', 'foo' );

The variable foo started out as a string and then became an array. In addition, we are able to multiply a string containing a number by an integer. In other languages, the interpreter is not so flexible.

This kind of flexibility is very useful, but it can lead to errors caused by functions being passed variables of the wrong type. A function written to expect an array will likely cause errors when passed a string.

Consider this function:

function slug_display_status( $post ) {
    echo $post->post_status;
}

This works perfectly as long as $post is an object of the WP_Post class and the public property $post_status has not been unset or changed to a type that cannot be echoed.

Technically, any object with a public property of $post_status or a magic __get() method that can return as a type that can be echoed will work. Yes, I’m getting extra pedantic here, but these are concerns we need to worry about in terms of validation.

On more realistic note, it is highly likely that the function would be passed a post id instead of an object. So here is a simple revision to validate our input:

function slug_display_status( $post ) {
    if ( is_numeric( $post ) ) {
        $post = get_post( $post );
    }
 
    if ( is_object( $post ) && isset( $post->post_status ) ) {
        echo $post->post_status;
    }
 
}

In this case, we are not just ensuring that the $post is an object with a public property of post_status, but we also provide an opportunity for the most likely type of invalid input to be converted to valid input.

Pulling It All Together

The same principles hold true when accepting data from an HTTP request. So now, I want to pull together everything we have discussed so far. We will use the various pieces I’ve shown so far to make a request that should have three parts — a post ID, an array of data to save, and a nonce. We will validate that the request is authorized, that the data we need is present, and that it is properly formatted. In addition, we will return information about the post that is affected or error data.

Before we start, keep the following two things in mind if you are a beginner. First, conditionals are evaluated left to right. That means that if you have two conditionals, and one can only be safely evaluated if the other passes, that’s OK if your order is right. For example:

if(  is_array( $bats ) && isset( $bats[ 'hats' ] ) ) {
 
}

If $bats is not an array, PHP will never get to the second conditional because it’s only valid if $bats is an array. On the other hand, if you reversed the order and $bats was not an array, then you would have a problem.

The other thing to keep in mind is that my syntax is guaranteed to cause errors on outdated versions of PHP that, statistically speaking, many of you are likely using. Do yourself a huge favor and stop.

First, we check if the request is authorized. This first bit of code is very similar to the code I showed last time we talked about before, except if the validation fails, we will set a status header to indicate an error and exit.

if( isset( $_POST[ '_wpnonce' ] ) && wp_verify_nonce( $_POST[ '_wpnonce' ] ) ){
    if( current_user_can( 'some_capability' ) ) {
        //request is authorized?
    }else{
        status_header( '403' );
        die();
    }
}else{
    status_header( '401' );
    die();
}

Now that we have determined that the request is in fact authorized, we need to ensure that the request has valid input. Let’s assume that we need two keys in the POST variable: post_id, which must be a valid post ID, and data, which must be an array containing two strings, “foo” and “bar.” Here is our updated code:

if( isset( $_POST[ '_wpnonce' ] ) && wp_verify_nonce( $_POST[ '_wpnonce' ] ) ){
    if( current_user_can( 'some_capability' ) ) {
        if( ! isset( $_POST[ 'post_id' ] )  || ! is_numeric( $_POST[ 'post_id' ] ) || ! is_object( get_post( $_POST[ 'post_id' ] ) ) ) {
            status_header( '400' );
            echo "post_id must be set and represent a valid post";
            die();
    }elseif( ! isset( $_POST[ 'data' ] ) || ! is_array( $_POST[ 'data' ] ) || ! isset( $_POST[ 'data' ][ 'foo' ], $_POST[ 'data' ][ 'foo' ] ) ) {
            status_header( '400' );
            echo 'data must be set and be an array containing the keys "foo" and "bar"';
            die();
        }else{
            //request is valid
        }
    } else {
        status_header( '403' );
        die();
    }
} else {
    status_header( '401' );
    die();
}

Presuming that we will accept anything for our data we are about to save, we would be ready to pass the $_POST data to update_post_meta(). But that is unlikely to be a good idea.

We should not presume that our data is the right type and is a valid option. Instead, let’s ensure that these pieces of data are strings and are valid options. Since we are not yet discussing sanitization, the only way to ensure the data we are saving is ready to save is to check that it is one of a pre-defined set of options we know are OK.

if ( isset( $_POST[ '_wpnonce' ] ) && wp_verify_nonce( $_POST[ '_wpnonce' ] ) ) {
    if ( current_user_can( 'some_capability' ) ) {
        if ( ! isset( $_POST[ 'post_id' ] ) || ! is_numeric( $_POST[ 'post_id' ] ) || ! is_object( get_post( $_POST[ 'post_id' ] ) ) ) {
            status_header( '400' );
            echo "post_id must be set and represent a valid post";
            die();
        } elseif ( ! isset( $_POST[ 'data' ] ) || ! is_array( $_POST[ 'data' ] ) || ! isset( $_POST[ 'data' ][ 'foo' ], $_POST[ 'data' ][ 'foo' ] ) ) {
            status_header( '400' );
            echo 'data must be set and be an array containing the keys "foo" and "bar"';
            die();
        } else {
            foreach ( [ 'foo', 'bar' ] as $key ) {
                if ( ! is_string( $_POST[ 'data' ][ $key ] ) || ! in_array( $_POST[ 'data' ][ $key ], array(
                        'hats',
                        'shoes',
                        'socks'
                    ) )
                ) {
                    status_header( 400 );
                    echo 'foo and bar must be either hats, shoes or socks';
                    die();
 
                } else {
 
                    foreach ( [ 'foo', 'bar' ] as $key ) {
                        update_post_meta( $_POST[ 'post_id' ],$key, $_POST[ 'data' ][ $key ] );
                    }
                    $post = get_post( $_POST[ 'post_id' ] );
                    if ( is_object( $post ) ) {
                        status_header( 200 );
                    } else {
                        status_header( 500 );
                    }
                    die();
 
                }
            }
        }
    } else {
        status_header( '403' );
        die();
    }
} else {
    status_header( '401' );
    die();
}

Please notice that I’ve been keeping this discussion abstract and not validating that the right endpoint is being used. In many cases, for example, when using the WordPress REST API or admin AJAX, WordPress will route the data to the correct input.

It is not uncommon to hook into init or admin_init and check if a certain GET or POST variable is set, and if so to process that data in a certain way. That is fine as long as the validation is correct.

For example, this would process a specific type of request or do nothing:

add_action( 'init', function() {
    if( isset( $_GET[ 'my-api-action' ] ) && in_array( $_GET[ 'my-api-action' ], array( 'read', 'delete' ) ) ) {
        echo function_that_proccessed_my_api();
        die();
    }
 
});

On the other hand, failing to check that the request is valid for this context is going to cause a lot of problems.

Validation Is Not Enough

So far, we’ve covered validation, which is an important step before saving data. But, just because data is formatted properly, does not make it safe. Similarly, just because data is in the right location and of the right type, does not mean it is safe to use.

Imagine we are saving a string. Do we want to accept any string? Or, should we exclude a string that has JavaScript or MySQL in it, which might be malicious? Or it might be a string that using PHP serialization or json syntax might interpreted later as an array or object.

Sanitization is all about context. In some contexts, you might only want to accept a string with no spaces or HTML tags. In other cases, you want to accept HTML.

Validation and sanitization can be easily confused. Validation functions return false when the input is invalid. This is great for use in a conditional, wrapped around a sanitization function and a save function.

Sanitization functions return a clean version of what is inputted to them. Of course, there may be nothing left after removing the unsafe part of the input. In some cases, that is just fine — saving the empty result, in other case validations, may require post sanitization.

Consider these examples using the validation function is_email() and the sanitization function sanitize_email():

var_dump( is_email( '[email protected]' ) ); //true
var_dump( is_email( 'x' ) ); //false
var_dump( sanitize_email( '1@' ) ); // ''
var_dump( sanitize_email( '1@ha^ts.com' ) ); //[email protected]
var_dump( sanitize_email( '[email protected]' ) ); //[email protected]

In the first two cases, we are validating that a string could be an email address. In the next three, we are converting an email address to something that is safe to put in the database, and we can expect to come out of the database as an email address.

We would put this together:

function slug_save_email( $email ) {
    if ( is_email( $email ) ) {
        $email = sanitize_email( $email );
        //you may now save $email;
    }
}

Or if we wanted to avoid saving empty strings, like this:

function slug_save_email( $email ) {
    if ( is_email( $email ) ) {
        $email = sanitize_email( $email );
        //you may now save $email;
 
    }
}

WordPress is full of helpful sanitization functions for use by context — sanitize_email() is just one. There are a ton in wp-includes/formatting.php. If your IDE has auto-complete for WordPress core functions, you can generally find the correct function that way since they are named by usage.

If your IDE does not have this functionality, then you should get a better one. I recommend PHPStorm.

Just be sure to read at least the inline docs for the function to make sure that the function does what you think it does. While you are in the source code, it’s helpful to read what these functions are doing. It will help you learn more about which one is the most appropriate to use. It will also prepare you for the situation that is bound to happen where there is no pre-built function for the sanitization type you need.

I also want to mention wp_kses() and its related functions. This function is one of the most resource-intensive ways to sanitize inputs. It is also highly extensible and reliable.

The main function wp_kses() takes as its second argument is an array of allowed HTML tags. There’s a variety of functions that wrap wp_kses() for use in specific context. For example, wp_kses_post() is designed to be used for sanitizing post content. What makes functions like wp_kses_post() either really useful or inappropriate is that they filter content differently based on the current user.

For example, if the current user is an admin wp_kses_post(), it will not remove any <script> tags, however, if the user is a contributor it will. If that is a problem, then you will need to use wp_kses() and specify an array of allowed HTML — though wp_kses_allowed_html() may be useful in getting the right array or look elsewhere.

Security Is Your Job

You are probably not a security expert. And, if you are, this article isn’t for you, but please leave some comments on what else you think beginners need to learn.

I am not a security expert, but secure code shouldn’t be optional. Failing to learn these basic best practices, or not taking the time to implement them, whether your clients or end-users notice, is a part of your job. Your clients or end-users are going to notice an exploited security vulnerability.

One awesome thing about WordPress is that the Internet is full of great example code on how to do pretty much anything with it. We can easily paste this code into our functions.php.

Do you check to make sure the code we copy and paste is secure first? Do you know how to evaluate and fix it? I hope that this article has helped you begin to evaluate for yourself and learn how to see a red flag coming at you.