Follow GitHub Link headers with Bash

11 May 2020 in Tech

When working with the GitHub API, data may be returned across multiple pages of results. This is communicated using a Link header, with rel="next". There are libraries available to help work with this header, but if you’re writing a shell script then it’s not as easy as it could be.

If you’re looking for an example script, here’s one that fetches multiple pages of pull requests and outputs them to the terminal:

bash
PULLS=""
URL="https://api.github.com/repos/:owner/:repo/pulls?per_page=100"
while [ "$URL" ]; do
RESP=$(curl -i -Ss -H "Authorization: token $GITHUB_TOKEN" "$URL")
HEADERS=$(echo "$RESP" | sed '/^\r$/q')
URL=$(echo "$HEADERS" | sed -n -E 's/Link:.*<(.*?)>; rel="next".*/\1/p')
PULLS="$PULLS $(echo "$RESP" | sed '1,/^\r$/d')"
done
echo $PULLS

Be careful! Each page is a list of objects, so $PULLS won’t be valid JSON. Thankfully, jq can process this format just fine as it works with streaming data

Make a HTTP request:

bash
RESPONSE=$(curl -i -Ss -H "Authorization: token $GITHUB_TOKEN" "$URL")

Extract just the HTTP Headers:

bash
echo $RESPONSE | sed '/^\r$/q'

Extract the rel="next" link:

bash
echo $RESPONSE | sed -n -E 's/Link:.*<(.*?)>; rel="next".*/\1/p')

Extract just the response body:

bash
echo $RESPONSE | sed '1,/^\r$/d')

How it works

There are a lot of cool tricks in the script above - let’s take them one at a time.

bash
while [ "$URL" ]; do

This script works due to the fact that $URL will be empty if there’s no rel="next" link header. We set the default URL to the first page, and if they all fit on a single page the loop will only execute once.

bash
curl -i

We use curl to fetch the data from the API. Using the -i flag adds the response headers in addition to the JSON payload returned

bash
sed '/^\r$/q'

This sed command runs until it finds a line that matches the supplied pattern, then stops processing the input (q means quit). By specifying ^\r$ as the match pattern it will stop executing as soon as it finds an empty line, signifying the end of the HTTP headers.

This means that once you run HEADERS=$(echo "$RESP" | sed '/^\r$/q'), the variable $HEADERS will contain only the HTTP headers for the response

bash
sed -n -E 's/Link:.*<(.*?)>; rel="next".*/\1/p'

Now that we’ve got the headers, we can use sed once again to extract the rel="next" link from the $HEADERS string. It looks for a line starting with Link:, then anything until it finds a string contained between < and >. It captures the matching pattern using parenthesis, but only if the next characters are rel="next". Finally, it returns only the value of the matching group using \1.

bash
URL=$(echo "$HEADERS" | sed -n -E 's/Link:.*<(.*?)>; rel="next".*/\1/p'

If there is a rel="next" link available, it’ll populate $URL and the loop will run again, fetching the next page. If not, it’ll be empty and the loop will stop executing.

bash
sed '1,/^\r$/d'

Finally, we need the JSON response without the headers. sed comes to the rescue, this time using the d (delete) modifier. This command says start at line 1, search until you find an empty line and then delete everything between those lines, returning the remaining content. This allows us to extract the response body without the HTTP headers.