Multigrep

Please keep in mind that this post is about 3 years old.
Technology may have changed in the meantime.

I am an author. And even though the actual books are well-organized, the writing process isn’t always. For a single book I have many mega-bytes of PDFs, ODTs and text files full of notes, drafts, documentation, etc.
So I needed a simple tool to find that one note I once wrote, in that huge pile of data.
Now, if those files were all text-based, I could use grep. But they aren’t. So I wrote a wrapper around grep, that allows me to also search PDFs and OpenDocument files.

This script searches all PDFs, OpenDocument files and text-based files in the current directory for a given term; the search is case-insensitive, and does not recurse into sub-directories.

Example:

$ multigrep 'SSH key'

Here’s the script:

#!/usr/bin/env bash

################################################################################

#                                 multigrep
#                                 door  Rob

# rob@ohreally.nl
################################################################################
# https://www.ohreally.nl

# This script searches all PDF, OpenDocument and text-based files
# in the current directory for the given string.

# For more info, see
# https://www.ohreally.nl/2020/12/11/multigrep/

################################################################################

# Copyright (c) 2020 Rob La Lau 

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

################################################################################

CAT=`which cat`             || exit 1
FILE=`which file`           || exit 1
GREP=`which grep`           || exit 1
PDF2TXT=`which pdf_to_text` || exit 1
SED=`which sed`             || exit 1
TIDY=`which tidyp`          || exit 1
UNZIP=`which unzip`         || exit 1

pattern=$@

process() {
	while read input; do
		"${GREP}" -i "${pattern}" | "${SED}" -e "s/^ */${file}: /;s/ *$//"
	done
}

for file in *; do
	[ -d "${file}" ] && {
		echo "  Directory : ${file}"
		continue
	}
	mimetype=`"${FILE}" --brief --mime-type "${file}"`
	major=${mimetype%%/*}
	minor=${mimetype##*/}
	case "${major}" in
		application)
			case "${minor}" in
				pdf)
					# PDF
					"${PDF2TXT}" --file "${file}" 2> /dev/null | process
					;;
				vnd.oasis.opendocument.*)
					# OpenDocument (all)
					"${UNZIP}" -p "${file}" | "${TIDY}" -quiet -xml -bare -utf8 2> /dev/null | process
					;;
				*)
					echo "  Unknown file type (${mimetype}) : ${file}"
					;;
			esac
			;;
		text)
			# text/*
			"${CAT}" "${file}" | process
			;;
		*)
			echo "  Unknown file type (${mimetype}) : ${file}"
			;;
	esac
done

This script can also be downloaded from my Github account.

Changelog

  • 2020-12-19
    Added message “Unknown file” for application/* mime-type.
  • 2020-12-11
    Initial publication.
  • 2020-??-??
    Conception.

REPUBLISHING TERMS

You may republish this article online or in print under our Creative Commons license. You may not edit or shorten the text, you must attribute the article to OhReally.nl and you must include the author’s name in your republication.

If you have any questions, please email rob@ohreally.nl

License

Creative Commons License AttributionCreative Commons Attribution
Multigrep