Data Validation using Pydantic Models

Validate Data, prevent script failures.

Sudarshan V

Oct 28, 2022 — 5 min read

In the realm of automation, scripts often thrive on the variables they receive. These variables determine the actions the script will perform. However, if a script encounters a variable in a format or data type it doesn't expect, it might throw an error with a message that's about as clear as mud. This is where data validation comes into play.

Validating the data passed to a script is like giving it a road map to success. It ensures that the script knows what to expect and how to handle it. Whether the data is coming from another script or an end device, validation helps prevent those cryptic error messages and keeps your automation journey smooth sailing.

What is Data Validation?

Data validation is like the gatekeeper of your data world—it's all about ensuring that the data you're dealing with is accurate, reliable, and fits the requirements of whatever you're trying to do with it. Think of it as quality control for your data before you start using it in your programs or analyses. There are various ways to validate data depending on what you need it for and what rules it needs to follow. And that's where pydantic swoops in to save the day!

In this post, we'll dive into how pydantic can be your trusty sidekick in the world of data validation. We'll explore how it works and why it's such a handy tool to have in your toolkit.

Example

Imagine you're tasked with automating the process of adding network objects to a firewall, specifically a Palo Alto Networks firewall. These network objects could represent things like IP addresses, subnets, or ranges of addresses.

Here's a snippet of what that might look like:

from rich import print
import requests
import json

data = [
    {"name": "test1", "ip": "1.1.1.1/32", "type": "ip-netmask"},
    {"name": "test2", "ip": "google.com", "type": "fqdn"},
    {"name": "test3", "ip": "1.1.1.30-1.1.1.20", "type": "ip-range"},
]


for object in data:
  url = f"https://192.168.1.41:443/restapi/v10.1/Objects/Addresses?location=vsys&vsys=vsys1&name={object['name']}"

  payload = json.dumps({
    "entry": [
      {
        "@name": f"{object['name']}",
        f"{object['type']}": f"{object['ip']}"
      }
    ]
  })
  headers = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'X-PAN-KEY': 'LUFRPT05SDExaFNseXkwZDZtUk9kNmRxYnhhWFAySUk9Vm8yQThKYVdNYzhzdGNMTkxzZlQxSC85SDhEUEkwWVBrajdKTStYUGZrQ3hpYkUrRnFBN3JtT1BWdnRKQjhxMA==',
    'Cookie': 'PHPSESSID=db0278ee49c9ace2f10e9cdd667aaa36'
  }

  response = requests.request("POST", url, headers=headers, data=payload, verify=False)
  print(response.text)

Executing the above script, results in a partial failure. It results in the successful creation of the network object test1 and test2. However, when it comes to test3 things take a turn for the worse. The firewall refuses to create the network object and throws an error message that looks something like this:

{"code":3,"message":"Invalid
Object","details":[{"@type":"CauseInfo","causes":[{"code":12,"module":"panui_mgmt","description":"Invalid Object:  test3
-> ip-range 1.1.1.30-1.1.1.20 range start IP is higher than range end IP.  test3 -> ip-range is invalid."}]}]}

This error message is the firewall's way of saying, "Hey, I can't work with this! Give me a proper IPv4 address." It's a clear indication that the data being passed to the firewall doesn't meet its requirements.

So, while our script may have partially succeeded in creating some network objects, it ultimately falls short due to the invalid data.

The Solution

To steer clear of those pesky partial failures, it's crucial to validate our dataset before it even gets near our script—especially when we're handing it off to the Palo Alto Networks API to create network objects.

Enter pydantic, our trusty ally in the world of data validation. We can craft a pydantic model to ensure our data meets the grade before it ever interacts with the script. Pydantic isn't just limited to defining types; it's also adept at performing conditional checks on our data. This means we can set up rules and conditions that our data must meet in order to pass validation. It's like having a built-in guardrail to ensure our data stays on the right track. Let's explore how we can harness this powerful feature to further enhance our data validation process.

Let's delve into the process of defining a model and setting up checks for the same dataset we examined in the previous example. This hands-on approach will give us a clearer understanding of how we can leverage pydantic's capabilities to ensure our data meets our criteria.

from rich import print
from typing import Literal
from pydantic import BaseModel, ValidationError, root_validator, validator
import ipaddress
import re

# Define our data is a list of type AddressObjects
class NetworkAddresses(BaseModel):
    addresses: list[AddressObject]

# Define the model of each Address Object
class AddressObject(BaseModel):
    name: str                                             # define that our name is of type string
    ip: str.                                              # define that ip is of type string
    type: Literal["ip-netmask", "fqdn", "ip-range"]       # define that type is one of "ip-netmask" or "fqdn" or "ip-range" 
    description: Optional[str]                            # define that description is an optional field, but if defined should be a string.

    # In addition to the above type validations, there are some inter dependencies
    # such as if type is "ip-netmask" then "ip" should be a valid ipv4 address or network.
    # we do this by defining a root validator as below
    
    @root_validator()
    def check_type_ip_combo_valid(cls, values):
        # If type is value, validate that ip matches the Palo Alto Networks defined regex
        if values["type"] == "fqdn":
            try:
                assert re.match(r"^([a-zA-Z0-9._-])+$", values["ip"])
            except AssertionError as e:
                raise ValueError(f" {values['ip']} Invalid FQDN") from e
        
        # If type is ip-netmask, validate ip is a valid ip-address or network.
        # we are using the ipaddress package to perform this validation.
        elif values["type"] == "ip-netmask":
            try:
                assert ipaddress.ip_address(values["ip"])
            except Exception:
                assert ipaddress.ip_network(values["ip"])
            else:
                raise ValueError(
                    f"{values['ip']} is not a valid IPv4/IPv6 address or network."
                )
        
        # If type is ip-range, validate that ip is a valid range of two ip-addresses
        # where the first ip in the range is lower than the second.
        elif values["type"] == "ip-range":
            try:
                assert "-" in values["ip"]
            except AssertionError as e:
                raise ValueError(
                    "Not a valid ip-range format. Example 1.1.1.1-1.1.1.10"
                ) from e
            try:
                ips = values["ip"].split("-")
                assert ipaddress.ip_address(ips[0])
                assert ipaddress.ip_address(ips[1])
                assert ips[0].split(".") < ips[1].split(".")
            except AssertionError as e:
                raise ValueError(
                    f"Start ip - {ips[0]} must be less than end ip - {ips[1]}."
                ) from e
        return values

Pydantic model with custom conditional checks

To put our data through the validation wringer, we'll attempt to initialize the class we defined with our dataset. Brace yourself, though—I've purposely sprinkled a few more errors into the dataset below for demonstration purposes. Let's see how our pydantic model handles the challenge!

data = [
    {"name": "test1", "ip": "1.1.1.300/32", "type": "ip-netmask"},
    {"name": "test2", "ip": "*.paloaltonetworks.com", "type": "fqdn"},
    {"name": "test3", "ip": "1.1.1.30-1.1.1.20", "type": "ip-range"},
]

try:
    output = NetworkAddresses(addresses=data)
    print(output)
except ValidationError as e:
    print(e)

Now, executing the validation results in the below validation error messages.

ValidationError(
    model='NetworkAddresses',
    errors=[
        {'loc': ('addresses', 0, '__root__'), 'msg': "'1.1.1.300/32' does not appear to be an IPv4 or IPv6 network", 'type': 'value_error'},
        {'loc': ('addresses', 1, '__root__'), 'msg': ' *.paloaltonetworks.com Invalid FQDN', 'type': 'value_error'},
        {'loc': ('addresses', 2, '__root__'), 'msg': 'Start ip - 1.1.1.30 must be less than end ip - 1.1.1.20.', 'type': 'value_error'}
    ]
)

Upon closer observation of the error messages, it clearly indicates the location of the error (0 indicating the 1st object in our list of data) and the same descriptive error message we defined in our class.

Conclusion

In conclusion, the importance of data validation cannot be overstated. By ensuring our data is thoroughly validated, we greatly reduce the risk of encountering partial failures in our scripts. Furthermore, catching errors in our data early on allows us to address them proactively, preventing potential headaches down the line. So remember, when it comes to scripting success, thorough data validation is your best friend.