Humanize

I scratched an itch last week and actually wrote some code (these days my primary IDE is Google Docs). Check out humanize, which replaces any large numbers in text piped into it with shortened versions using SI prefix symbols (K, M, G, T, etc.). So “File size: 972917401236” becomes “File size: 973G”.

This came about because I was debugging an issue with a Parquet file using parquet-tools, whose output looks like:

% parquet-tools meta part-01373-r-01373.gz.parquet | grep "row group"
row group 1:           RC:100 TS:121513011 OFFSET:4 
row group 2:           RC:100 TS:44877183 OFFSET:14293218 
row group 3:           RC:111 TS:28645460 OFFSET:19704281 
row group 4:           RC:168 TS:31794536 OFFSET:23387243 
row group 5:           RC:363 TS:28078938 OFFSET:27531686 
row group 6:           RC:15547 TS:899825873 OFFSET:31242035 
row group 7:           RC:9963 TS:47930700 OFFSET:142353955 
row group 8:           RC:20100 TS:911417774 OFFSET:155302131 
row group 9:           RC:20100 TS:889810662 OFFSET:269749697 
row group 10:          RC:10100 TS:855908709 OFFSET:382090935 
row group 11:          RC:7282 TS:1532744480 OFFSET:486019001 

Tired of squinting and counting digits to see how big some of these numbers were, I figured there’d be some easy bash/sed way to convert these numbers into something more readable. It is pretty easy to do this for isolated numbers using numfmt:

% echo "1234567890" | numfmt --to=si --round=nearest
1.2G

But that doesn’t work on numbers embedded in text:

% echo "This is a big number: 1234567890" | numfmt --to=si --round=nearest
numfmt: invalid number: ‘This’

GNU sed can do a replacement with the result of a function, but it’s rather ugly:

% echo "This is a big number: 1234567890" | sed -re 's/[0-9]+/`numfmt --to=si --round=nearest &`/g;s/.*/echo &/;e'
This is a big number: 1.2G

But I’m doing this work on a Mac, which does not come with GNU sed (or numfmt, for that matter, although you can brew install coreutils to get gnumfmt). I started working on a perl one-liner to do it, but decided this was something that needed to exist on its own, and should be easy for anyone to install.

I chose to write it in Go because it has an easy way for people to install binaries with go get. (Rust has something similar with cargo install, but Go is more common in my workplace.) The problem with go get is that you have to run it again to upgrade, so for my fellow Mac users I added a Homebrew tap.

Long story short:

% parquet-tools meta part-01373-r-01373.gz.parquet | grep "row group" | humanize -binary
row group 1:           RC:100 TS:116Mi OFFSET:4 
row group 2:           RC:100 TS:43Mi OFFSET:14Mi 
row group 3:           RC:111 TS:27Mi OFFSET:19Mi 
row group 4:           RC:168 TS:30Mi OFFSET:22Mi 
row group 5:           RC:363 TS:27Mi OFFSET:26Mi 
row group 6:           RC:15Ki TS:858Mi OFFSET:30Mi 
row group 7:           RC:10Ki TS:46Mi OFFSET:136Mi 
row group 8:           RC:20Ki TS:869Mi OFFSET:148Mi 
row group 9:           RC:20Ki TS:849Mi OFFSET:257Mi 
row group 10:          RC:10Ki TS:816Mi OFFSET:364Mi 
row group 11:          RC:7Ki TS:1Gi OFFSET:464Mi