Grammar

The parsing tool used is Lark. We generate a LALR(1) parser using the following grammar (which can be found at sci_watch/assets/grammar.lark).

query                           : factor
                                | query AND factor -> and_clause
                                | query OR factor -> or_clause
                                | query NOT factor -> not_clause

!factor                         : token -> token
                                | INTITLE_KW COLON scoped_token -> in_title_clause
                                | INCONTENT_KW COLON scoped_token -> in_content_clause
                                | BEGIN_KW COLON scoped_token -> begin_clause

token                           : word_with_wildcard
                                | expr
                                | LEFT_PAR query RIGHT_PAR -> parenthesis_clause

scoped_token                    : word_with_wildcard
                                | expr
                                | LEFT_PAR scoped_query RIGHT_PAR -> parenthesis_clause

scoped_query                    : scoped_token
                                | scoped_query AND scoped_token -> and_clause
                                | scoped_query OR scoped_token -> or_clause
                                | scoped_query NOT scoped_token -> not_clause

expr                            : QUOTE /[A-Za-z0-9\-]+/ ( /[A-Za-z0-9\-]+/)* QUOTE -> expression
                                | QUOTE /[A-Za-z0-9\-]+/ ( /[A-Za-z0-9\-]+/)* QUOTE TILDE distance -> proximity

word_with_wildcard              : /[A-Za-z0-9\-\*\?]+/
                                | /([A-Za-z0-9\-]*\??[A-Za-z0-9\-]*)+\*/
distance                        : /[1-9][0-9]*/

QUOTE                           : "\""
AND                             : "AND"
OR                              : "OR"
NOT                             : "NOT"
LEFT_PAR                        : "("
RIGHT_PAR                       : ")"
TILDE                           : "~"
COLON                           : ":"
SPACE                           : /\s+/

INTITLE_KW                      : "intitle"i
INCONTENT_KW                    : "incontent"i
BEGIN_KW                        : "begin"i

%ignore SPACE

The entry point is query.