Follow GitHub Link headers with Bash
When working with the GitHub API, data may be returned across multiple pages of results. This is communicated using a Link
header, with rel="next"
. There are libraries available to help work with this header, but if you’re writing a shell script then it’s not as easy as it could be.
If you’re looking for an example script, here’s one that fetches multiple pages of pull requests and outputs them to the terminal:
bash
PULLS=""URL="https://api.github.com/repos/:owner/:repo/pulls?per_page=100"while [ "$URL" ]; doRESP=$(curl -i -Ss -H "Authorization: token $GITHUB_TOKEN" "$URL")HEADERS=$(echo "$RESP" | sed '/^\r$/q')URL=$(echo "$HEADERS" | sed -n -E 's/Link:.*<(.*?)>; rel="next".*/\1/p')PULLS="$PULLS $(echo "$RESP" | sed '1,/^\r$/d')"doneecho $PULLS
Be careful! Each page is a list of objects, so
$PULLS
won’t be valid JSON. Thankfully,jq
can process this format just fine as it works with streaming data
Make a HTTP request:
bash
RESPONSE=$(curl -i -Ss -H "Authorization: token $GITHUB_TOKEN" "$URL")
Extract just the HTTP Headers:
bash
echo $RESPONSE | sed '/^\r$/q'
Extract the rel="next"
link:
bash
echo $RESPONSE | sed -n -E 's/Link:.*<(.*?)>; rel="next".*/\1/p')
Extract just the response body:
bash
echo $RESPONSE | sed '1,/^\r$/d')
How it works
There are a lot of cool tricks in the script above - let’s take them one at a time.
bash
while [ "$URL" ]; do
This script works due to the fact that $URL
will be empty if there’s no rel="next"
link header. We set the default URL to the first page, and if they all fit on a single page the loop will only execute once.
bash
curl -i
We use curl
to fetch the data from the API. Using the -i
flag adds the response headers in addition to the JSON payload returned
bash
sed '/^\r$/q'
This sed
command runs until it finds a line that matches the supplied pattern, then stops processing the input (q
means quit). By specifying ^\r$
as the match pattern it will stop executing as soon as it finds an empty line, signifying the end of the HTTP headers.
This means that once you run HEADERS=$(echo "$RESP" | sed '/^\r$/q')
, the variable $HEADERS
will contain only the HTTP headers for the response
bash
sed -n -E 's/Link:.*<(.*?)>; rel="next".*/\1/p'
Now that we’ve got the headers, we can use sed
once again to extract the rel="next"
link from the $HEADERS
string. It looks for a line starting with Link:
, then anything until it finds a string contained between <
and >
. It captures the matching pattern using parenthesis, but only if the next characters are rel="next"
. Finally, it returns only the value of the matching group using \1
.
bash
URL=$(echo "$HEADERS" | sed -n -E 's/Link:.*<(.*?)>; rel="next".*/\1/p'
If there is a rel="next"
link available, it’ll populate $URL
and the loop will run again, fetching the next page. If not, it’ll be empty and the loop will stop executing.
bash
sed '1,/^\r$/d'
Finally, we need the JSON response without the headers. sed
comes to the rescue, this time using the d
(delete) modifier. This command says start at line 1
, search until you find an empty line and then delete everything between those lines, returning the remaining content. This allows us to extract the response body without the HTTP headers.