add notes to lex.lhs
This commit is contained in:
parent
91e7012c47
commit
f17053c7d9
|
@ -4,9 +4,37 @@ The parser uses a separate lexer for two reasons:
|
||||||
1. sql syntax is very awkward to parse, the separate lexer makes it
|
1. sql syntax is very awkward to parse, the separate lexer makes it
|
||||||
easier to handle this in most places (in some places it makes it
|
easier to handle this in most places (in some places it makes it
|
||||||
harder or impossible, the fix is to switch to something better than
|
harder or impossible, the fix is to switch to something better than
|
||||||
parsec
|
parsec)
|
||||||
|
|
||||||
|
2. using a separate lexer gives a huge speed boost because it reduces
|
||||||
|
backtracking. (We could get this by making the parsing code a lot more
|
||||||
|
complex also.)
|
||||||
|
|
||||||
|
= Lexing and dialects
|
||||||
|
|
||||||
|
The main dialect differences:
|
||||||
|
|
||||||
|
symbols follow different rules in different dialects
|
||||||
|
|
||||||
|
e.g. postgresql has a flexible extensible-ready syntax for operators
|
||||||
|
which are parsed here as symbols
|
||||||
|
|
||||||
|
sql server using [] for quoting identifiers, and so they don't parse
|
||||||
|
as symbols here (in other dialects including ansi, these are used for
|
||||||
|
array operations)
|
||||||
|
|
||||||
|
quoting of identifiers is different in different dialects
|
||||||
|
|
||||||
|
there are various other identifier differences:
|
||||||
|
ansi has :host_param
|
||||||
|
there are variants on these like in @sql_server adn in #oracle
|
||||||
|
|
||||||
|
string quoting follows different rules in different dialects,
|
||||||
|
e.g. postgresql has $$ quoting
|
||||||
|
|
||||||
|
todo: public documentation on dialect definition - and dialect flags
|
||||||
|
|
||||||
|
|
||||||
2. using a separate lexer gives a huge speed boost
|
|
||||||
|
|
||||||
> -- | This is the module contains a Lexer for SQL.
|
> -- | This is the module contains a Lexer for SQL.
|
||||||
> {-# LANGUAGE TupleSections #-}
|
> {-# LANGUAGE TupleSections #-}
|
||||||
|
@ -355,3 +383,184 @@ Some helper combinators
|
||||||
> peekSatisfy :: (Char -> Bool) -> Parser ()
|
> peekSatisfy :: (Char -> Bool) -> Parser ()
|
||||||
> peekSatisfy p = do
|
> peekSatisfy p = do
|
||||||
> void $ lookAhead (satisfy p)
|
> void $ lookAhead (satisfy p)
|
||||||
|
|
||||||
|
|
||||||
|
postgresql notes:
|
||||||
|
u&
|
||||||
|
SELECT 'foo'
|
||||||
|
'bar';
|
||||||
|
is equivalent to:
|
||||||
|
SELECT 'foobar';
|
||||||
|
|
||||||
|
SELECT 'foo' 'bar';
|
||||||
|
is invalid
|
||||||
|
|
||||||
|
(this should be in ansi also)
|
||||||
|
|
||||||
|
definitely do major review and docs:
|
||||||
|
when can escapes and prefixes be using with syntactic string literals
|
||||||
|
when can they be combined
|
||||||
|
when can e.g. dollar quoting be used
|
||||||
|
what escaping should there be, including unicode escapes
|
||||||
|
|
||||||
|
|
||||||
|
E'string'
|
||||||
|
|
||||||
|
with a range of escapes which should appear in the dialect data type
|
||||||
|
|
||||||
|
dollar quoted strings
|
||||||
|
never with prefixes/escapes
|
||||||
|
|
||||||
|
B''
|
||||||
|
X''
|
||||||
|
|
||||||
|
numbers
|
||||||
|
|
||||||
|
:: cast
|
||||||
|
|
||||||
|
|
||||||
|
type 'string' - literals only, not array types
|
||||||
|
ansi allows for some specific types
|
||||||
|
'string'::type
|
||||||
|
cast('string' as type)
|
||||||
|
|
||||||
|
can use dollar quoting here
|
||||||
|
typename('string') (not all types)
|
||||||
|
check these in the parser for keyword issues
|
||||||
|
|
||||||
|
extended operator rules
|
||||||
|
|
||||||
|
$1 positional parameter
|
||||||
|
()
|
||||||
|
[]
|
||||||
|
,
|
||||||
|
;
|
||||||
|
: array slices and variable names/hostparam
|
||||||
|
*
|
||||||
|
.
|
||||||
|
|
||||||
|
some operator precedence notes
|
||||||
|
|
||||||
|
SELECT 3 OPERATOR(pg_catalog.+) 4;
|
||||||
|
|
||||||
|
diff from ansi:
|
||||||
|
|
||||||
|
all the same symbols + more + different rules about parsing multi char
|
||||||
|
symbols (ansi is trivial here, postgresql is not trivial and
|
||||||
|
extensible)
|
||||||
|
|
||||||
|
identifiers: same, doublecheck the u&
|
||||||
|
hostparam: same, but with implementation issues because : is also a
|
||||||
|
symbol in postgresql. this might be a little tricky to deal with
|
||||||
|
|
||||||
|
string literals:
|
||||||
|
u&?, does pg support n?
|
||||||
|
|
||||||
|
numbers: same
|
||||||
|
whitespace, comments: same
|
||||||
|
|
||||||
|
make sure there is a list of lexical syntax which is valid in postgres
|
||||||
|
and not in ansi, and vice versa, and have explicit tests for
|
||||||
|
these. There might also be situations here where a string is valid in
|
||||||
|
both, but lexes differently. There is definitely cases like this in
|
||||||
|
the main syntax.
|
||||||
|
|
||||||
|
|
||||||
|
action plan:
|
||||||
|
no abstract syntax changes are needed
|
||||||
|
write down a spec for ansi and for postgresql lexical syntax
|
||||||
|
create a list of tests for postgresql
|
||||||
|
include eveything from ansi which is the same: maybe refactor the
|
||||||
|
tests to make this maintainable
|
||||||
|
|
||||||
|
design for escaping issues
|
||||||
|
(affects ansi also)
|
||||||
|
design for string literal-like syntax and for continuation strings
|
||||||
|
(affects ansi also)
|
||||||
|
|
||||||
|
the test approach in general is first to parse basic examples of each
|
||||||
|
kind of token, then to manually come up with some edge cases to test,
|
||||||
|
and then to generate a good representative set of tokens (probably the
|
||||||
|
same set as the previous two categories), and create the cross product
|
||||||
|
of pairs of these tokens, eliminate ones when the tokens are next to
|
||||||
|
each other and it doesn't parse as the two separate tokens, using
|
||||||
|
manually written rules (want to be super accurate here - no false
|
||||||
|
positives or negatives), then test these all parse good as
|
||||||
|
well. Separating out the lexing in this way and doing this approach I
|
||||||
|
think gives a very good chance of minimising bugs in the basic
|
||||||
|
parsing, especially in the hairy bits.
|
||||||
|
|
||||||
|
|
||||||
|
= lexical syntax
|
||||||
|
|
||||||
|
One possible gotcha: there isn't a one-one correpsondence between e.g
|
||||||
|
identifiers and string literals in the lexical syntax, and identifiers
|
||||||
|
and string literals in the main syntax.
|
||||||
|
|
||||||
|
== ansi
|
||||||
|
|
||||||
|
=== symbol
|
||||||
|
=== identifier
|
||||||
|
+ escaping
|
||||||
|
=== quoted identifier
|
||||||
|
+ escaping, prefixes
|
||||||
|
=== host param
|
||||||
|
|
||||||
|
=== string literal-like
|
||||||
|
+ escaping, prefixes
|
||||||
|
|
||||||
|
=== number literals
|
||||||
|
|
||||||
|
=== whitespace
|
||||||
|
|
||||||
|
=== comments
|
||||||
|
|
||||||
|
== postgresql
|
||||||
|
|
||||||
|
=== symbol
|
||||||
|
|
||||||
|
=== identifier
|
||||||
|
|
||||||
|
== postgresql
|
||||||
|
|
||||||
|
=== symbol
|
||||||
|
extended set of symbols + extensibility + special cases
|
||||||
|
: is a symbol and also part of host param
|
||||||
|
|
||||||
|
=== identifier
|
||||||
|
|
||||||
|
same as ansi? is the character set the same?
|
||||||
|
|
||||||
|
=== quoted identifier
|
||||||
|
|
||||||
|
same as ansi?
|
||||||
|
|
||||||
|
=== host param
|
||||||
|
|
||||||
|
same as ansi (check char set)
|
||||||
|
|
||||||
|
=== string literal-like
|
||||||
|
|
||||||
|
dollar quoting
|
||||||
|
E quoting
|
||||||
|
missing n'?
|
||||||
|
|
||||||
|
|
||||||
|
=== number literals
|
||||||
|
|
||||||
|
same as ansi, i think
|
||||||
|
|
||||||
|
=== whitespace
|
||||||
|
|
||||||
|
same as ansi
|
||||||
|
|
||||||
|
=== comments
|
||||||
|
|
||||||
|
same as ansi
|
||||||
|
|
||||||
|
=== additions
|
||||||
|
$1 positional parameter
|
||||||
|
|
||||||
|
---- find what else is in hssqlppp to support mysql, oracle, sql
|
||||||
|
server
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue