what 262,715 regex questions on stack overflow haven't answered
complement, intersection, and why the most common regex workarounds are hundreds of times slower than they need to be
i'm a phd candidate at taltech in estonia, working at the high-assurance software laboratory. my research is mostly about automata theory and string algorithms, with a focus on high-performance regex engines, but also anything else to do with text and streams. if you need to search through your email, process tons of documentation, or find that one file from your computer, i'm your guy (!). i've been looking for more industrial applications to my research post phd, so feel free to reach out if you are looking for consulting or collaboration on anything related to my work.
i work with my advisors Juhan Ernits (taltech) and Margus Veanes (microsoft research) on Derivative-Based Symbolic Automata, an incremental approach to hard problems by pushing heavy computation into lazy step-by-step unwinding of automata - i.e., solving only as much as needed, when needed. we've applied this to regex matching, model checking, and string constraint solving, with promising results.
the main project i've been working on for the past few years is RE#, a high-performance regex engine in F# and Rust. RE# supports extended regex features (lookarounds, intersection, complement) while being fast and memory efficient thanks to derivative-based algorithms and optimizations. some discoveries we made with RE# helped improve the performance of the .NET NonBacktracking regex engine, originally developed by my advisor Margus Veanes, as well as inspire other projects, like the LLM tokenizer for the AICI project at microsoft (the thing that helps constrain LLM outputs).
i'm a big fan of F# and functional programming in practice, and i'm constantly working on various smaller projects and libraries in my spare time, which you can find on my github. some highlights include fflat, a tiny native compiler for F#, built on top of bflat, a plugin for obsidian, a note taking app i use daily and r-toml which is just a little demo of how automata based techniques can process .toml files vastly faster than traditional parsers. i enjoy microoptimizing things and do a lot of benchmarking both in research and other projects, so if you like reading assembly or want 'the fastest possible' implementations of something, you might find something interesting through my repos.
complement, intersection, and why the most common regex workarounds are hundreds of times slower than they need to be
every regex engine that promises linear time breaks that promise the moment you ask for all matches. the problem has been there since the 70s, hiding in the iteration loop.