Working with large files with sed

12 Jun 2013 in Tech

This week, I've been using sed quite a lot. sed is short for stream editor, and is designed to work with large amounts of streaming text.

I've been working with Google BigQuery, inserting large amounts of data. Unfortunately, some of the data was malformed and was causing errors whilst I was trying to ingest it. Luckily, BigQuery tells you what line number + column the error occurred on.

I initially tried to open the file in vim, but I realised that the text file was probably a bit big. As sed is designed for streaming text, it's the perfect tool for the job.

sed allows you to specify a range of line numbers to apply your command to. You can specify a certain line, or a range of lines.

If we wanted to replace "Dog" with "Cat" on the 3rd line only:

bash
sed -i '3 s/Dog/Cat/' /path/to/file

If we wanted to replace "Dog" with "Cat" on lines 1-100:

bash
sed -i '1,100 s/Dog/Cat/' /path/to/file

By default, sed prints every line. This isn't what I wanted, so I ran all of my commands with the -n flag, meaning "don't print by default". Then, I used the p command to say "print these lines".

Imagine that my error was on line 38224. It'd be pretty hard to get to that line in a text editor, but it's really easy using sed.

bash
sed -n '38224p' /path/to/file

That says "don't print anything by default, but print line 32844'.

If you wanted to take lines 30129 to 33982 and work with them independently, you can use sed to write them out to another file:

bash
sed -n '30129,33982w' /path/to/file

You can then pipe the output of sed into xclip or pbcopy to get only the lines you're interested in copied onto your clipboard.

sed is an old tool, but it still has it's uses. If you ever need to work with a subset of a large amount of text data, you could do a lot worse than sed.