Okay so I didn’t make a new language from scratch so much as a superset of an existing language. Here’s the problem: Shell commands can do some things incredibly well, while Python does a lot of other things well. I mainly code in Python, but often times I want Shell functionality–not that I can’t do it in Python, just that I’m too lazy to do it. As a specific example, I needed to get all lines in a file beginning with “iteration”. In Shell, this is trivial using the grep
command. In Python, however, you’d have to open the file, read all lines, and then use a list comprehension or filter
to get the lines you need: too much work. So I designed a very basic extension to Python that would work as follows:
lines = list`grep ^iteration {filename}`
Boom, easy. We create a syntax using backticks to call Shell functions, allow for the use of Python expressions within that, and just for added convenience, we create formatters (in this case, list
) that format the output in the type we want. Now for the hard part: implementing this.
The core idea
We’d like to transpile this to valid Python syntax and then run it as Python code, for the simplest implementation. Fundamentally, we need to do three things: replace the stuff inside curly braces, run the Shell command, and then format it to the type specified. For the first, we’ll simply take advantage of Python’s f-strings. Done. For the second, we’ll use the subprocess
module. Specifically, we’ll use subprocess.Popen
, since subprocess.run
does not work with pipes (guess who learned that the hard way). We’ll need a host language that implements this transpiling. We could do this in Python, but that’s easy and boring. Let’s use C++! C++ does string manipulation pretty well. Plus, with the C-style syntax of loops, there’s little confusion around things like what the end index is (looking at you, splices). And I was getting rusty in C++ anyway, it’s about time I played with it a bit. Within C++, we’ll need to choose a standard. We’ll use C++20, not just because it’s the newest, but more importantly, it comes with the starts_with
function for strings, which we’ll need later. It also has all the goodness of C++11 through C++17. For C++, I prefer CLion over VS Code, because it’s really good at showing me best practices.
The main code
Our main code starts off trivial enough. We’ll expect one command-line argument–that being the source filename, and we’ll write to out.py. Let’s parse our input file line-by-line:
int main(int argc, char* argv[]) {
if (argc < 2) {
std::cerr << "Usage: " << argv[0] << " <file>\n";
return 1;
}
std::ifstream fin(argv[1]);
std::ofstream fout("out.py");
fout << "import subprocess\n\n";
std::string line;
while (std::getline(fin, line)) {
process_line(line, fout);
}
}
We’ll then run the code itself, updating the system PATH variable along the way:
const char* path = std::getenv("PATH");
std::filesystem::path cur_path = std::filesystem::current_path();
std::string new_path = std::string(path) + ":" + cur_path.string();
if (setenv("PATH", new_path.c_str(), 1) != 0)
throw std::runtime_error("Failed to set PATH");
std::system("python3 out.py");
I use Mac and occasionally Linux, so I’m not targeting Windows users here. It’s easy enough to make it work for them, I just don’t really want to. Next, we need to implement the process_line
function.
Transpiling a line
We’ll first need to deal with indents at the beginning of the line. That means keeping track of indents in previous lines. Thank God we chose a language that has the static
keyword–we could use global variables, but that’s getting into dubious practices.
std::ostream& process_line(std::string& line, std::ostream& out)
{
static int indent_level = 0;
bool spaces_to_indent = true;
std::smatch match;
std::regex regex("^\\s+");
// Add in indents
if (std::regex_search(line, match, regex)) {
// Count the number of indents
int spaces = 0;
for (char c : match[0].str()) {
if (c == '\t') {
spaces_to_indent = false;
indent_level++;
}
else if (c == ' ')
spaces++;
}
if (spaces_to_indent)
indent_level = static_cast<int>(spaces / 4);
// Now replace the indents
line.replace(match.position(), match.length(), match[0].str());
} else {
indent_level = 0;
}
...
}
Similar to how we overload the ostream operator, we take an ostream object by reference and then return it, having written to it. Most of this code is pretty trivial. With Python in particular, we need to keep track of whether indents are spaces or tabs, since mixing the two is illegal starting in 3.7 (I think). I’m sure the actual counting could be done in a simpler way, but Copilot filled this in and I wasn’t complaining.
Now for the fun part: we look for backticks in lines, and within this if-condition, pretty much the rest of this function will lie.
if (line.find('`') != std::string::npos) {
...
} else {
out << line << "\n";
}
return out;
We’ll start by finding all indices of backticks. But wait–what if you need to say, print a backtick within Python? We need an escaping mechanism. Sigh. We’ll start by getting some code from Stack Overflow to do this:
// Replace all occurrences of a substring within a string
// from https://stackoverflow.com/a/28766792/2713263
std::string string_replace( const std::string & s, const std::string & findS, const std::string & replaceS )
{
std::string result = s;
auto pos = s.find( findS );
if ( pos == std::string::npos ) {
return result;
}
result.replace( pos, findS.length(), replaceS );
return string_replace( result, findS, replaceS );
}
and then use this in the condition:
// Parse template args in the string.
if (line.find('`') != std::string::npos) {
// Find all indices.
std::vector<size_t> cmd_idx;
size_t cur_tick_idx = 0;
size_t next_tick_idx;
// Find all backticks
while ((next_tick_idx = line.find('`', cur_tick_idx)) != std::string::npos) {
// First, check that it is not escaped.
static std::vector<std::pair<std::string, std::string>> patterns = {
{ "\\\\" , "\\" },
{ "\\`", "`" },
};
for ( const auto & p : patterns ) {
line = string_replace( line, p.first, p.second );
}
if (next_tick_idx > 0 && line[next_tick_idx - 1] != '\\')
cmd_idx.push_back(next_tick_idx);
cur_tick_idx = next_tick_idx + 1;
}
...
The escaping logic is from Stack Overflow and is just a simple recursive call. Other than that, we’re just collecting indices here. (Note: at the time of writing, this part isn’t well-tested, so it could have bugs. I’ll get to it soon. C’est la vie).
Now time for the formatter logic. We need to replace the stuff inside backticks with a subprocess call and then implement formatting. Let’s start the injecting:
// Begin substitution using formatters.
for (size_t i{}, j{1}; i < cmd_idx.size(); i += 2, j += 2) {
std::string substr = line.substr(cmd_idx[i] + 1, cmd_idx[j] - cmd_idx[i] - 1);
// Add in indents
if (spaces_to_indent)
out << std::string(indent_level * 4, ' ');
else
out << std::string(indent_level, '\t');
// Inject subprocess call
out << "_ = subprocess.Popen(f'" << substr << "', shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT).communicate()[0].decode('utf-8')\n";
This is really just basic string manipulation. With some pencil and paper, it’s easy to see why this works. Note the use of f-strings here. Next, let’s check for formatters. Our formatter names can be alphanumeric, have dots and underscores.
// Check for formatters
if (cmd_idx[i] > 0 && (std::isalnum(line[cmd_idx[i] - 1]) || line[cmd_idx[i] - 1] == '_')) {
int k;
for (k = cmd_idx[i] - 2; k >= 0; --k) {
if (!std::isalnum(line[k]) && line[k] != '_' && line[k] != '.')
break;
}
std::string format = line.substr(k + 1, cmd_idx[i] - k - 1);
// Now that we have the format, remove it from the line.
line.erase(k + 1, cmd_idx[i] - k - 1);
cmd_idx[i] -= format.size();
cmd_idx[j] -= format.size();
Just more string manip. Once we get the format, we need to remove it from the line so that it doesn’t linger around in the final code. Now that we have the format, we just need to apply it:
// Apply formatter
// If "str", do nothing.
if (format != "str") {
type_formatter formatter{format, indent_level, spaces_to_indent};
std::string formatted = formatter.format();
// Add indents
if (spaces_to_indent)
out << std::string(indent_level * 4, ' ');
else
out << std::string(indent_level, '\t');
// Output formatted string
out << formatted;
}
}
// Now, replace the part in quotes with our variable
out << line.replace(cmd_idx[i], cmd_idx[j] - cmd_idx[i] + 1, "_") << "\n";
We’ve delegated the work to a type_formatter class, which we’ll define in a minute. But we have our full main file, in less than 200 lines of code! Let’s look at the header file first:
class basic_formatter
{
public:
basic_formatter() = default;
virtual ~basic_formatter() = default;
[[nodiscard]] virtual std::string format() const = 0;
};
class type_formatter : public basic_formatter
{
std::string fmt;
int indent_level;
bool spaces_to_indent;
[[nodiscard]] std::string get_indent_string(bool) const;
[[nodiscard]] std::string get_safe_formatter() const;
public:
explicit type_formatter(std::string, int, bool);
[[nodiscard]] std::string format() const override;
};
We have an abstract base class and a class implementing the logic. Note the use of the C++17 [[nodiscard]]
attribute. This was a CLion suggestion which I absolutely agree with (and prior, I didn’t know it existed). The implementations of the functions are pretty simple, so I won’t bother pasting that code here. You can see the full code on the GitHub.
What’s next?
It’s not enough to have a transpiler. We need IDE support. In Part 2, I’ll talk about creating an extension for VS Code to add syntax highlighting for our brand new language.