more docs in Parser.lhs

2014-04-19 15:10:45 +03:00 · 2014-04-19 15:10:45 +03:00 · 59826ecce2
commit 59826ecce2
parent ddfac442ab
1 changed files with 215 additions and 97 deletions
--- a/Language/SQL/SimpleSQL/Parser.lhs
+++ b/Language/SQL/SimpleSQL/Parser.lhs
@ -1,44 +1,8 @@
-Notes about the parser:
+= TOC:
 The lexers appear at the bottom of the file. There tries to be a clear
 separation between the lexers and the other parser which only use the
 lexers, this isn't 100% complete at the moment and needs fixing.
 Left factoring:
 The parsing code is aggressively left factored, and try is avoided as
 much as possible. Use of try often makes the code hard to follow, so
 this has helped the readability of the code a bit. More importantly,
 debugging the parser and generating good parse error messages is aided
 greatly by left factoring. Apparently it can also help the speed but
 this hasn't been looked into.
 Error messages:
 A lot of care has been given to generating good error messages. There
 are a few utils below which partially help in this area. There is also
 a plan to write a really simple expression parser which doesn't do
 precedence and associativity, and the fix these with a pass over the
 ast. I don't think there is any other way to sanely handle the common
 prefixes between many infix and postfix multiple keyword operators,
 and some other ambiguities also. This should help a lot in generating
 good error message also.
 There is a set of crafted bad expressions in ErrorMessages.lhs, these
 are used to guage the quality of the error messages and monitor
 regressions by hand. The use of <?> is limited as much as possible,
 since unthinking liberal sprinkling of it seems to make the error
 messages much worse, and also has a similar problem to gratuitous use
 of try - you can't easily tell which appearances are important and
 which aren't.
 Both the left factoring and error message work are greatly complicated
 by the large number of shared prefixes of the various elements in SQL
 syntax.
 TOC:
 notes
 Public api
 Names - parsing identifiers
 Typenames
 Value expressions
@ -60,8 +24,156 @@ query expressions
  common table expressions
  query expression
  set operations
 lexers
 utilities
 = Notes about the code
 The lexers appear at the bottom of the file. There tries to be a clear
 separation between the lexers and the other parser which only use the
 lexers, this isn't 100% complete at the moment and needs fixing.
 == Left factoring
 The parsing code is aggressively left factored, and try is avoided as
 much as possible. Try is avoided because:
 * when it is overused it makes the code hard to follow
 * when it is overused it makes the parsing code harder to debug
 * it makes the parser error messages much worse
 The code could be made a bit simpler with a few extra 'trys', but this
 isn't done because of the impact on the parser error
 messages. Apparently it can also help the speed but this hasn't been
 looked into.
 == Parser rrror messages
 A lot of care has been given to generating good parser error messages
 for invalid syntax. There are a few utils below which partially help
 in this area.
 There is a set of crafted bad expressions in ErrorMessages.lhs, these
 are used to guage the quality of the error messages and monitor
 regressions by hand. The use of <?> is limited as much as possible:
 each instance should justify itself by improving an actual error
 message.
 There is also a plan to write a really simple expression parser which
 doesn't do precedence and associativity, and the fix these with a pass
 over the ast. I don't think there is any other way to sanely handle
 the common prefixes between many infix and postfix multiple keyword
 operators, and some other ambiguities also. This should help a lot in
 generating good error messages also.
 Both the left factoring and error message work are greatly complicated
 by the large number of shared prefixes of the various elements in SQL
 syntax.
 == Main left factoring issues
 There are three big areas which are tricky to left factor:
 * typenames
 * value expressions which can start with an identifier
 * infix and suffix operators
 === typenames
 There are a number of variations of typename syntax. The standard
 deals with this by switching on the name of the type which is parsed
 first. This code doesn't do this currently, but might in the
 future. Taking the approach in the standard grammar will limit the
 extensibility of the parser and might affect the ease of adapting to
 support other sql dialects.
 === identifier value expressions
 There are a lot of value expression nodes which start with
 identifiers, and can't be distinguished the tokens after the initial
 identifier are parsed. Using try to implement these variations is very
 simple but makes the code much harder to debug and makes the parser
 error messages really bad.
 Here is a list of these nodes:
 * identifiers
 * function application
 * aggregate application
 * window application
 * typed literal: typename 'literal string'
 * interval literal which is like the typed literal with some extras
 There is further ambiguity e.g. with typed literals with precision,
 functions, aggregates, etc. - these are an identifier, followed by
 parens comma separated value expressions or something similar, and it
 is only later that we can find a token which tells us which flavour it
 is.
 There is also a set of nodes which start with an identifier/keyword
 but can commit since no other syntax can start the same way:
 * case
 * cast
 * exists, unique subquery
 * array constructor
 * multiset constructor
 * all the special syntax functions: extract, position, substring,
  convert, translate, overlay, trim, etc.
 The interval literal mentioned above is treated in this group at the
 moment: if we see 'interval' we parse it either as a full interval
 literal or a typed literal only.
 Some items in this list might have to be fixed in the future, e.g. to
 support standard 'substring(a from 3 for 5)' as well as regular
 function substring syntax 'substring(a,3,5) at the same time.
 The work in left factoring all this is mostly done, but there is still
 a substantial bit to complete and this is by far the most difficult
 bit. At the moment, the work around is to use try, the downsides of
 which is the poor parsing error messages.
 === infix and suffix operators
 == permissiveness
 The parser is very permissive in many ways. This departs from the
 standard which is able to eliminate a number of possibilities just in
 the grammar, which this parser allows. This is done for a number of
 reasons:
 * it makes the parser simple - less variations
 * it should allow for dialects and extensibility more easily in the
  future (e.g. new infix binary operators with custom precedence)
 * many things which are effectively checked in the grammar in the
  standard, can be checked using a typechecker or other simple static
  analysis
 To use this code as a front end for a sql engine, or as a sql validity
 checker, you will need to do a lot of checks on the ast. A
 typechecker/static checker plus annotation to support being a compiler
 front end is planned but not likely to happen too soon.
 Some of the areas this affects:
 typenames: the variation of the type name should switch on the actual
 name given according to the standard, but this code only does this for
 the special case of interval type names. E.g. you can write 'int
 collate C' or 'int(15,2)' and this will parse as a character type name
 or a precision scale type name instead of being rejected.
 value expressions: every variation on value expressions uses the same
 parser/syntax. This means we don't try to stop non boolean valued
 expressions in boolean valued contexts in the parser. Another area
 this affects is that we allow general value expressions in group by,
 whereas the standard only allows column names with optional collation.
 These are all areas which are specified (roughly speaking) in the
 syntax rather than the semantics in the standard, and we are not
 fixing them in the syntax but leaving them till the semantic checking
 (which doesn't exist in this code at this time).
 > {-# LANGUAGE TupleSections #-}
 > -- | This is the module with the parser functions.
 > module Language.SQL.SimpleSQL.Parser
@ -454,9 +566,8 @@ See the stringToken lexer below for notes on string literal syntax.
 === star
 used in select *, select x.*, and agg(*) variations, and some other
-places as well. Because it is quite general, the parser doesn't
+places as well. The parser doesn't attempt to check that the star is
-attempt to check that the star is in a valid context, it parses it OK
+in a valid context, it parses it OK in any value expression context.
 in any value expression context.
 > star :: Parser ValueExpr
 > star = Star <$ symbol "*"
@ -488,7 +599,8 @@ value expression parens, row ctor and scalar subquery
 == case, cast, exists, unique, array/multiset constructor, interval
-All of these start with a fixed keyword which is reserved.
+All of these start with a fixed keyword which is reserved, so no other
 syntax can start with the same keyword.
 === case expression
@ -1037,6 +1149,8 @@ expose the b expression for window frame clause range between
 == helper parsers
 This is used in interval literals and in interval type names.
 > intervalQualifier :: Parser (IntervalTypeField,Maybe IntervalTypeField)
 > intervalQualifier =
 >     (,) <$> intervalField
@ -1049,7 +1163,7 @@ expose the b expression for window frame clause range between
 >             (parens ((,) <$> unsignedInteger
 >                          <*> optionMaybe (comma *> unsignedInteger)))
-TODO: use this in extract
+TODO: use datetime field in extract also
 use a data type for the datetime field?
 > datetimeField :: Parser String
@ -1057,6 +1171,9 @@ use a data type for the datetime field?
 >                                     ,"hour","minute","second"])
 >                 <?> "datetime field"
 This is used in multiset operations (value expr), selects (query expr)
 and set operations (query expr).
 > duplicates :: Parser (Maybe SetQuantifier)
 > duplicates = optionMaybe $
 >     choice [All <$ keyword_ "all"
@ -1100,7 +1217,7 @@ tref
 >                  <$> parens (commaSep valueExpr)
 >                 ,return $ TRSimple n]]
 >         >>= optionSuffix aliasSuffix
->     aliasSuffix j = option j (TRAlias j <$> alias)
+>     aliasSuffix j = option j (TRAlias j <$> fromAlias)
 >     joinTrefSuffix t =
 >         (TRJoin t <$> option False (True <$ keyword_ "natural")
 >                   <*> joinType
@ -1108,7 +1225,8 @@ tref
 >                   <*> optionMaybe joinCondition)
 >         >>= optionSuffix joinTrefSuffix
-TODO: factor the join stuff to produce better error messages
+TODO: factor the join stuff to produce better error messages (and make
 it more readable)
 > joinType :: Parser JoinType
 > joinType = choice
@ -1126,13 +1244,12 @@ TODO: factor the join stuff to produce better error messages
 >     ,JInner <$ keyword_ "join"]
 > joinCondition :: Parser JoinCondition
-> joinCondition =
+> joinCondition = choice
->     choice [keyword_ "on" >> JoinOn <$> valueExpr
+>     [keyword_ "on" >> JoinOn <$> valueExpr
->            ,keyword_ "using" >> JoinUsing <$> parens (commaSep1 name)
+>     ,keyword_ "using" >> JoinUsing <$> parens (commaSep1 name)]
 >            ]
-> alias :: Parser Alias
+> fromAlias :: Parser Alias
-> alias = Alias <$> tableAlias <*> columnAliases
+> fromAlias = Alias <$> tableAlias <*> columnAliases
 >   where
 >     tableAlias = optional (keyword_ "as") *> name
 >     columnAliases = optionMaybe $ parens $ commaSep1 name
@ -1146,11 +1263,9 @@ pretty trivial.
 > whereClause = keyword_ "where" *> valueExpr
 > groupByClause :: Parser [GroupingExpr]
-> groupByClause = keywords_ ["group","by"]
+> groupByClause = keywords_ ["group","by"] *> commaSep1 groupingExpression
 >            *> commaSep1 groupingExpression
 >   where
->     groupingExpression =
+>     groupingExpression = choice
 >       choice
 >       [keyword_ "cube" >>
 >        Cube <$> parens (commaSep groupingExpression)
 >       ,keyword_ "rollup" >>
@ -1204,9 +1319,8 @@ allows offset and fetch in either order
 >     With <$> option False (True <$ keyword_ "recursive")
 >          <*> commaSep1 withQuery <*> queryExpr
 >   where
->     withQuery =
+>     withQuery = (,) <$> (fromAlias <* keyword_ "as")
->         (,) <$> (alias <* keyword_ "as")
+>                     <*> parens queryExpr
 >             <*> parens queryExpr
 == query expression
@ -1214,10 +1328,9 @@ This parser parses any query expression variant: normal select, cte,
 and union, etc..
 > queryExpr :: Parser QueryExpr
-> queryExpr =
+> queryExpr = choice
->   choice [with
+>     [with
->          ,choice [values,table, select]
+>     ,choice [values,table, select] >>= optionSuffix queryExprSuffix]
 >           >>= optionSuffix queryExprSuffix]
 >   where
 >     select = keyword_ "select" >>
 >         mkSelect
@ -1247,45 +1360,44 @@ be in the public syntax?
 >       ,_teFetchFirst :: Maybe ValueExpr}
 > tableExpression :: Parser TableExpression
-> tableExpression =
+> tableExpression = mkTe <$> from
->    mkTe <$> from
+>                        <*> optionMaybe whereClause
->         <*> optionMaybe whereClause
+>                        <*> option [] groupByClause
->         <*> option [] groupByClause
+>                        <*> optionMaybe having
->         <*> optionMaybe having
+>                        <*> option [] orderBy
->         <*> option [] orderBy
+>                        <*> offsetFetch
 >         <*> offsetFetch
 >  where
 >     mkTe f w g h od (ofs,fe) =
 >         TableExpression f w g h od ofs fe
 > queryExprSuffix :: QueryExpr -> Parser QueryExpr
-> queryExprSuffix qe =
+> queryExprSuffix qe = cqSuffix >>= optionSuffix queryExprSuffix
->     (CombineQueryExpr qe
+>   where
->      <$> (choice
+>     cqSuffix = CombineQueryExpr qe
->          [Union <$ keyword_ "union"
+>                <$> setOp
->          ,Intersect <$ keyword_ "intersect"
+>                <*> (fromMaybe SQDefault <$> duplicates)
->          ,Except <$ keyword_ "except"] <?> "set operator")
+>                <*> corr
->      <*> (fromMaybe SQDefault <$> duplicates)
+>                <*> queryExpr
->      <*> option Respectively
+>     setOp = choice [Union <$ keyword_ "union"
->                 (Corresponding <$ keyword_ "corresponding")
+>                    ,Intersect <$ keyword_ "intersect"
->      <*> queryExpr)
+>                    ,Except <$ keyword_ "except"]
->     >>= optionSuffix queryExprSuffix
+>             <?> "set operator"
 >     corr = option Respectively (Corresponding <$ keyword_ "corresponding")
 wrapper for query expr which ignores optional trailing semicolon.
 > topLevelQueryExpr :: Parser QueryExpr
-> topLevelQueryExpr =
+> topLevelQueryExpr = queryExpr >>= optionSuffix ((semi *>) . return)
 >      queryExpr >>= optionSuffix ((semi *>) . return)
 wrapper to parse a series of query exprs from a single source. They
 must be separated by semicolon, but for the last expression, the
 trailing semicolon is optional.
 > queryExprs :: Parser [QueryExpr]
-> queryExprs =
+> queryExprs = (:[]) <$> queryExpr
->     (:[]) <$> queryExpr
+>              >>= optionSuffix ((semi *>) . return)
->     >>= optionSuffix ((semi *>) . return)
+>              >>= optionSuffix (\p -> (p++) <$> queryExprs)
 >     >>= optionSuffix (\p -> (p++) <$> queryExprs)
 ----------------------------------------------
@ -1373,15 +1485,15 @@ making a decision on how to represent numbers, the client code can
 make this choice.
 > numberLiteral :: Parser String
-> numberLiteral = lexeme (
+> numberLiteral =
->     (choice [int
+>     lexeme (numToken <* notFollowedBy (alphaNum <|> char '.'))
 >             >>= optionSuffix dot
 >             >>= optionSuffix fracts
 >             >>= optionSuffix expon
 >            ,fract "" >>= optionSuffix expon])
 >     <* notFollowedBy (alphaNum <|> char '.'))
 >     <?> "number literal"
 >   where
 >     numToken = choice [int
 >                        >>= optionSuffix dot
 >                        >>= optionSuffix fracts
 >                        >>= optionSuffix expon
 >                       ,fract "" >>= optionSuffix expon]
 >     int = many1 digit
 >     fract p = dot p >>= fracts
 >     dot p = (p++) <$> string "."
@ -1480,6 +1592,10 @@ todo: work out the symbol parsing better
 >         ,-- handle string in separate parts
 >          -- e.g. 'part 1' 'part 2'
 >          do --can this whitespace be factored out?
 >             -- since it will be parsed twice when there is no more literal
 >             -- yes: split the adjacent quote and multiline literal
 >             -- into two different suffixes
 >             -- won't need to call lexeme at the top level anymore after this
 >          try (whitespace <* nlquote)
 >          s <- manyTill anyChar nlquote
 >          optionSuffix moreString (s0 ++ s)
@ -1552,7 +1668,10 @@ an optional alias, e.g. select a a from t. If we write select a from
 t, we have to make sure the from isn't parsed as an alias. I'm not
 sure what other places strictly need the blacklist, and in theory it
 could be tuned differently for each place the identifierString/
-identifier parsers are used to only blacklist the bare minimum.
+identifier parsers are used to only blacklist the bare
 minimum. Something like this might be needed for dialect support, even
 if it is pretty silly to use a keyword as an unquoted identifier when
 there is a effing quoting syntax as well.
 The standard has a weird mix of reserved keywords and unreserved
 keywords (I'm not sure what exactly being an unreserved keyword
@ -2082,8 +2201,7 @@ means).
 >     {peErrorString = show e
 >     ,peFilename = sourceName p
 >     ,pePosition = (sourceLine p, sourceColumn p)
->     ,peFormattedError = formatError src e
+>     ,peFormattedError = formatError src e}
 >     }
 >   where
 >     p = errorPos e