Querying JSON and XML with jq and xq

Ben Asher

Engineering Leadership

4 minute read

January 9, 2022

jq is a tool I use at least two to three times a week (maybe daily when it's my week to investigate sentry issues in production), so I thought I'd write about it and its XML offshoot, xq (via yq).

A lot of our day-to-day work involves reviewing API request and response data in structured text formats, whether it be debugging, reviewing input or output formats for vendor APIs, etc. Often, I find I have a question about a JSON blob, and that blob is too large to inspect by just giving it a "once over." Enter, jq. In it's most basic form, jq lets you query JSON documents, but it's also useful for linting and formatting minified JSON.

Basic Usage

Let's say I have the following JSON blob:

1{"firstName":"Ben","lastName":"Asher","email":"ben@example.com"}

Copy this to your clipboard and run the following command (after you install jq with, for example, brew install jq1)

1pbpaste | jq '.'

This prints the following:

1{
2  "firstName": "Ben",
3  "lastName": "Asher",
4  "email": "ben@example.com"
5}

What we did is pipe the output of our clipboard into jq and use the jq command . to print it with better formatting (and colors!). If the JSON were invalid, it would tell you where it's invalid. I've found this immensely useful for quickly fixing issues in our public API documentation since our API documentation vendor requires valid JSON in our sample responses that we present readers.

Querying JSON

Things get even better when you need to query JSON. I often find myself investigating a Sentry error and debugging an issue that requires inspecting JSON involved in the error. Let's say I have a list of people in JSON format:

1[
2  {
3    "firstName": "Ben",
4    "lastName": "Asher",
5    "email": "ben@example.com"
6  },
7  {
8    "firstName": "Ronald",
9    "lastName": "Martin",
10    "email": "ronald@example.com"
11  }
12]

If I want see just emails, I can do jq '.[].email'. That jq command string tells it to look at each array element (you can also do .[0] to select just the first one, for example), and from each one, select the email attribute. This prints:

1"ben@example.com"
2"ronald@example.com"

For a recent issue, I had data like the JSON above, and, if one of the object fields were missing, the backend code threw an error. The amount of data was too large to inspect manually. With jq, I could query each attribute, and, if jq couldn't find it, it printed null. So, to find which attribute was the culprit, I did jq '.[].email' | grep 'null' and replaced email with each sibling attribute until I got a null.

Querying XML

jq is super powerful, but recently I wanted to do something like the example above, but with XML. Our customer job feeds for vendors are in XML and often contain thousands of jobs. I found out a new vendor would not ingest jobs that didn't have a valid two digit country code. Before we launched with the vendor, I needed to figure out how many jobs would be impacted. I figured there must be an equivalent for XML, right? There is, and it's a tool called yq that wraps jq and supports multiple other formats. It ships with a tool called xq that you can use for XML.

Back to my task, I needed to figure out how many job postings in the feed would be dropped by the vendor. For simplicity, I'll use the snippet below as our "feed" and walk you through what I did. You can install yq with brew install python-yq.

Let's copy the feed to our clipboard:

1<?xml version="1.0" encoding="utf-8"?>
2<source>
3  <publisher>Ashby</publisher>
4  <job>
5    <title><![CDATA[Product Engineer]]></title>
6    <country><![CDATA[US]]></country>
7  </job>
8  <job>
9    <title><![CDATA[Full-Stack Engineer]]></title>
10    <city><![CDATA[Remote]]></city>
11    <country><![CDATA[CA]]></country>
12  </job>
13  <job>
14    <title><![CDATA[Pilot]]></title>
15    <city><![CDATA[Remote]]></city>
16    <country><![CDATA[PT]]></country>
17  </job>
18  <job>
19    <title><![CDATA[Software Engineer]]></title>
20    <city><![CDATA[Remote]]></city>
21    <country><![CDATA[DE]]></country>
22  </job>
23  <job>
24    <title><![CDATA[VP of Sales]]></title>
25    <city><![CDATA[San Francisco]]></city>
26    <country><![CDATA[United States or Canada]]></country>
27  </job>
28  <job>
29    <title><![CDATA[Data Analyst]]></title>
30  </job>
31</source>

Then save the contents of our clipboard to a file:

1pbpaste > feed.xml

From here, I want to query countries. Countries are at the path .source.job[].country. I can do the following to quickly check them out:

1cat feed.xml | xq '.source.job[].country' | less

Looking at that output, I see some nulls and some country codes, so I know I'm in right place.

Stepping back, our vendor only accepts two digit country codes. The way our feed code works is that we look up the user-inputted country's country code (for example, via node-countries). If we can find the country code, we use it, and, if we can't, we use the original value (people sometimes input invalid countries like "Remote" or "United States or Canada"). Also, some won't have countries, so we see some nulls.

Operators

To know what the vendor is going to drop, we can use the length of the country field (only ones of length two will work), which xq (and jq) can calculate! To get the length of all country codes, I can do this:

1cat feed.xml | xq '.source.job[].country | length' | less

That's right! xq and jq have their own operators, which you can pipe output to within the command. The operator above will output the length of each country, and looking at the output, we can see that we're getting 0s where we had nulls before, and 2s for most things (phew!). To get our final answer, we can run the following command:

1cat feed.xml | xq '.source.job[].country | length' | grep -v '2' | wc -l

This takes that output that we were viewing in less and pipes it to grep. With -v, we tell grep to only give us output that doesn't match, so that gets us all of the "bad" countries. Piping that to wc -l will count the number of lines and give us our answer.

Conclusion

jq and its offshoots are super powerful, and I use them all the time, even if it's just the basic path-based searching of JSON files (or formatting to make smaller ones nicer to read). For everything else, I can usually find a useful operator (via Stack Overflow or jq's docs) that'll help me put together fancier queries.

•••

Some other ways to install jq are here