A bit of Linux play. How to turn an active unstructured log file into structured data.
image source: nightcafe
Like many programs, my The Things Network Gateway generates a verbose log file. A lot of information. Every so many (> 1000) lines, there is data that I want to track: when a Thing sends data to my Gateway. I want to use that data while the log is actively written to by the Gateway.
Line of interest (click to zoom):
I want to capture lines with that structure. The rest of the log (the vast majority) is not what I want to retrieve and process.
This is what I want to get out of that active log. Both on the console and in a .csv file:
rules
- I only want to track Things that contact my gateway.
I can filter relevant lines that have this unique text (bold):
Jan 20 19:54:22 raspberryttn lora_pkt_fwd[23528]: INFO: Received pkt from mote: 260BCF0F (fcnt=0) - I am interested in the red parts: date and time, Thing identifier and message counter
- These values should be written to a .csv file, with ';' as separator. And to the console.
- Everything else should be ignored
Here is a Linux command line that will do this. I will break it apart in this post.
Where you see BAR, it is actually var. The e14 forum does not accept slash var slash.
This is one single command. The \ allows me to show it in 4 lines and make it fit to screen.
stdbuf -o0 tail -f /BAR/log/lora_pkt_fwd.log 2> /dev/null | \ stdbuf -i0 -o0 awk '/: INFO: Received pkt from mote:/ {print $1, $2, $3, $11,$12}' | \ stdbuf -i0 -o0 cut -d ' ' -f 1,2,3,4,5 --output-delimiter=';' | \ stdbuf -i0 -o0 sed 's/(fcnt=//g' | \ stdbuf -i0 -o0 sed 's/)//g' | \ stdbuf -i0 -o0 tee output.csv
It's a pipeline:
- tail -f keeps reading the log file as it is written. It streams every new line that's written to the log.
The 2> /dev/null part takes care that I only forward log file content. Warnings and errors (E.g: when the log is being rotated by Linux every week) are ignored. - awk finds relevant lines and retrieves the fields that we need (date, time, identifier, counter). Any line that doesn't match our pattern is ignored.
- cut changes the separator from a space to semicolon
- sed: called twice to remove fixed text in the counter field that I don't want
- tee sends the stream both to the standard out (console) and a csv file.
- stdbuf -i0 -o0 (or stdbuf -o0): I disabled buffering. Used before every command in the pipeline. I want to real-time extract events as they are happening.
If I would process a megabyte log file at once, I would not use these. But I want to get every event as it is happening.
The output is a structured (character separated values) file.
Jan;20;21:34:47;D3FB2208;4938 Jan;20;21:35:52;0030D318;475 Jan;20;21:45:52;0030D318;476 Jan;20;21:46:52;A6BC7955;29136 Jan;20;21:54:52;260BCF0F;1 Jan;20;21:54:52;260BCF0F;2 Jan;20;21:56:22;0030D318;477 Jan;20;21:57:22;7D42C57D;20388 Jan;20;21:58:48;0048BAB4;802 Jan;20;21:58:48;2601643B;53122 Jan;20;22:05:34;0030D318;478 Jan;20;22:16:44;0030D318;479 Jan;20;22:16:44;FF007FF8;84 Jan;20;22:21:53;179BEEF2;50279 Jan;20;22:26:23;0030D318;480 Jan;20;22:35:53;0030D318;481
A different process can pipe this output into a processor script. This could be a database writer, an mqtt publisher, or a program that flashes an LED when an upload is received.
This is performant. I'm running it on a Pi 3. And it uses virtually no resources.
If I replace the tail command (reading only what gets added to the log by the gateway) with a cat command (dumping a full 40 Mb log into the pipeline), it finishes in a second.
why this way?
- non-intrusive: Every command is standard Linux
- non-intrusive. The command accepts the log file of the Gateway as-is. No software changes.
- non-intrusive. Does not require additional installs.
- non-intrusive. This isn't a program, or a shell script. It's just a command.
- non-intrusive. Can be run by a generic user. No sudo needed.
- the output file is also streamable in other processes. It can be consumed (E.g.: by another tail command) while this command is writing to it.
image: one process parses the raw Gateway log. A second process monitors the structured csv file in parallel.