Working with large files with sed
This week, I've been using
sed quite a lot.
sed is short for stream editor, and is designed to work with large amounts of streaming text.
I've been working with Google BigQuery, inserting large amounts of data. Unfortunately, some of the data was malformed and was causing errors whilst I was trying to ingest it. Luckily, BigQuery tells you what line number + column the error occurred on.
I initially tried to open the file in
vim, but I realised that the text file was probably a bit big. As
sed is designed for streaming text, it's the perfect tool for the job.
sed allows you to specify a range of line numbers to apply your command to. You can specify a certain line, or a range of lines.
If we wanted to replace "Dog" with "Cat" on the 3rd line only:
sed -i '3 s/Dog/Cat/' /path/to/file
If we wanted to replace "Dog" with "Cat" on lines 1-100:
sed -i '1,100 s/Dog/Cat/' /path/to/file
sed prints every line. This isn't what I wanted, so I ran all of my commands with the
-n flag, meaning "don't print by default". Then, I used the
p command to say "print these lines".
Imagine that my error was on line 38224. It'd be pretty hard to get to that line in a text editor, but it's really easy using
sed -n '38224p' /path/to/file
That says "don't print anything by default, but print line 32844'.
If you wanted to take lines 30129 to 33982 and work with them independently, you can use sed to write them out to another file:
sed -n '30129,33982w' /path/to/file
You can then pipe the output of
pbcopy to get only the lines you're interested in copied onto your clipboard.
sed is an old tool, but it still has it's uses. If you ever need to work with a subset of a large amount of text data, you could do a lot worse than