Hi,
I wasn't sure which list would be appropriate for this question, but since there are many knowledgeable members on this one I thought I ask here.
I'm trying to write a regular expression that matches function and class definitions in C/C++ and defuns in lisp code. I intend to use it with sed and `git blame'. My first attempt relies on indentation. That obviously breaks rather often.
So I was wondering if there was a way to "count" the braces and parentheses with regular expressions. If that is possible, I could easily count and find the matching brace.
Here is an example use I was thinking of.
$ git blame -L "/[a-zA-Z0-9_*]+ +${method_name}/,/^}$/" filename.cxx
Thanks for any hints.
On Sun, Mar 18, 2012 at 23:46, suvayu ali fatkasuvayu+linux@gmail.com wrote:
So I was wondering if there was a way to "count" the braces and parentheses with regular expressions. If that is possible, I could easily count and find the matching brace.
I forgot to mention, the hints I found with some searching doesn't quite work with multiline matches.
On Sun, Mar 18, 2012 at 23:46:17 +0100, suvayu ali fatkasuvayu+linux@gmail.com wrote:
I'm trying to write a regular expression that matches function and class definitions in C/C++ and defuns in lisp code. I intend to use it with sed and `git blame'. My first attempt relies on indentation. That obviously breaks rather often.
Mathematically, regular expressions can't match braces like this to an unbounded depth. You might be able to use extensions to common regular expression implementations that aren't strictly regular expressions to do this. It might be easier to write your own parser. There are tools, like flex and bison to help with this.
On 18Mar2012 20:19, Bruno Wolff III bruno@wolff.to wrote: | On Sun, Mar 18, 2012 at 23:46:17 +0100, | suvayu ali fatkasuvayu+linux@gmail.com wrote: | > | >I'm trying to write a regular expression that matches function and class | >definitions in C/C++ and defuns in lisp code. I intend to use it with | >sed and `git blame'. My first attempt relies on indentation. That | >obviously breaks rather often. | | Mathematically, regular expressions can't match braces like this to an | unbounded depth. You might be able to use extensions to common regular | expression implementations that aren't strictly regular expressions to do this.
In particular, regular expressions are not recursive. Hence not capable of arbitrarily matching nested constructs. But you _can_ construct one to match a certain depth. For example, four or five deep probably covers most things in reasonable code (lisp excluded; that is naturally very bracket intensive).
If you're using sed you probably want the "extended regular expressions mode", turned on by -E in GNU sed IIRC.
Personally, for easy of debugging, I would construct the regexp from smaller pieces in shell. Untested example:
no_br='[^()]*' # no brackets upto1="${no_br}|((${no_br}))" # no brackets or "no brackets" in brackets upto2="${no_br}|((${upto1}))" upto3="${no_br}|((${upto2}))" upto4="${no_br}|((${upto3}))" sed -E -n "/^${upto4}$/!p" <blame-data >blame-bad-bracketing
| It might be easier to write your own parser. There are tools, like flex | and bison to help with this.
Indeed. Or you could write a hand rolled recursive descent parser in what ever language you like provided it makes character by character access fairly easy (C, python, etc; not awk or sed).
Cheers,
Hi Bruno and Cameron,
On Mon, Mar 19, 2012 at 07:04, Cameron Simpson cs@zip.com.au wrote:
On 18Mar2012 20:19, Bruno Wolff III bruno@wolff.to wrote: | On Sun, Mar 18, 2012 at 23:46:17 +0100, | suvayu ali fatkasuvayu+linux@gmail.com wrote: | > | >I'm trying to write a regular expression that matches function and class | >definitions in C/C++ and defuns in lisp code. I intend to use it with | >sed and `git blame'. My first attempt relies on indentation. That | >obviously breaks rather often. | | Mathematically, regular expressions can't match braces like this to an | unbounded depth. You might be able to use extensions to common regular | expression implementations that aren't strictly regular expressions to do this.
In particular, regular expressions are not recursive. Hence not capable of arbitrarily matching nested constructs. But you _can_ construct one to match a certain depth. For example, four or five deep probably covers most things in reasonable code (lisp excluded; that is naturally very bracket intensive).
If you're using sed you probably want the "extended regular expressions mode", turned on by -E in GNU sed IIRC.
Personally, for easy of debugging, I would construct the regexp from smaller pieces in shell. Untested example:
no_br='[^()]*' # no brackets upto1="${no_br}|((${no_br}))" # no brackets or "no brackets" in brackets upto2="${no_br}|((${upto1}))" upto3="${no_br}|((${upto2}))" upto4="${no_br}|((${upto3}))" sed -E -n "/^${upto4}$/!p" <blame-data >blame-bad-bracketing
| It might be easier to write your own parser. There are tools, like flex | and bison to help with this.
Indeed. Or you could write a hand rolled recursive descent parser in what ever language you like provided it makes character by character access fairly easy (C, python, etc; not awk or sed).
I was expecting this to be the case, after all using regular expressions is not the same as using a programming language. I'll think how I can adapt my use case as per your suggestions. Thanks a lot for the different pointers. :)
Cheers,