Working with large files with sed
This week, I've been using sed
quite a lot. sed
is short for stream editor, and is designed to work with large amounts of streaming text.
I've been working with Google BigQuery, inserting large amounts of data. Unfortunately, some of the data was malformed and was causing errors whilst I was trying to ingest it. Luckily, BigQuery tells you what line number + column the error occurred on.
I initially tried to open the file in vim
, but I realised that the text file was probably a bit big. As sed
is designed for streaming text, it's the perfect tool for the job.
sed
allows you to specify a range of line numbers to apply your command to. You can specify a certain line, or a range of lines.
If we wanted to replace "Dog" with "Cat" on the 3rd line only:
bash
sed -i '3 s/Dog/Cat/' /path/to/file
If we wanted to replace "Dog" with "Cat" on lines 1-100:
bash
sed -i '1,100 s/Dog/Cat/' /path/to/file
By default, sed
prints every line. This isn't what I wanted, so I ran all of my commands with the -n
flag, meaning "don't print by default". Then, I used the p
command to say "print these lines".
Imagine that my error was on line 38224. It'd be pretty hard to get to that line in a text editor, but it's really easy using sed
.
bash
sed -n '38224p' /path/to/file
That says "don't print anything by default, but print line 32844'.
If you wanted to take lines 30129 to 33982 and work with them independently, you can use sed to write them out to another file:
bash
sed -n '30129,33982w' /path/to/file
You can then pipe the output of sed
into xclip
or pbcopy
to get only the lines you're interested in copied onto your clipboard.
sed
is an old tool, but it still has it's uses. If you ever need to work with a subset of a large amount of text data, you could do a lot worse than sed
.