2015-07-31 23:04:18 +02:00
2015-08-01 12:22:07 +02:00
The parser uses a separate lexer for two reasons:
2015-07-31 23:04:18 +02:00
2015-08-01 12:22:07 +02:00
1. sql syntax is very awkward to parse, the separate lexer makes it
easier to handle this in most places (in some places it makes it
harder or impossible, the fix is to switch to something better than
2016-02-13 10:40:49 +01:00
parsec)
2. using a separate lexer gives a huge speed boost because it reduces
backtracking. (We could get this by making the parsing code a lot more
complex also.)
= Lexing and dialects
The main dialect differences:
symbols follow different rules in different dialects
e.g. postgresql has a flexible extensible-ready syntax for operators
which are parsed here as symbols
sql server using [] for quoting identifiers, and so they don't parse
as symbols here (in other dialects including ansi, these are used for
array operations)
quoting of identifiers is different in different dialects
there are various other identifier differences:
ansi has :host_param
there are variants on these like in @sql_server adn in #oracle
string quoting follows different rules in different dialects,
e.g. postgresql has $$ quoting
todo: public documentation on dialect definition - and dialect flags
2015-07-31 23:04:18 +02:00
2015-08-01 11:13:53 +02:00
2015-07-31 23:04:18 +02:00
> -- | This is the module contains a Lexer for SQL.
> {-# LANGUAGE TupleSections #-}
2016-02-12 11:22:19 +01:00
> module Language.SQL.SimpleSQL.Lex
2015-08-01 11:13:53 +02:00
> (Token(..)
> ,lexSQL
2015-07-31 23:04:18 +02:00
> ,prettyToken
> ,prettyTokens
> ,ParseError(..)
2016-02-15 19:20:24 +01:00
> ,Dialect(..)
> ,tokensWillPrintAndLex
> ,tokenListWillPrintAndLex
> ) where
2015-07-31 23:04:18 +02:00
2016-02-12 11:51:06 +01:00
> import Language.SQL.SimpleSQL.Dialect
2015-07-31 23:04:18 +02:00
> import Text.Parsec (option,string,manyTill,anyChar
> ,try,string,many1,oneOf,digit,(<|>),choice,char,eof
2015-08-01 11:13:53 +02:00
> ,many,runParser,lookAhead,satisfy
> ,setPosition,getPosition
> ,setSourceColumn,setSourceLine
2015-08-02 14:29:35 +02:00
> ,sourceName, setSourceName
2016-02-13 19:28:12 +01:00
> ,sourceLine, sourceColumn
> ,notFollowedBy)
2015-07-31 23:04:18 +02:00
> import Language.SQL.SimpleSQL.Combinators
> import Language.SQL.SimpleSQL.Errors
> import Control.Applicative hiding ((<|>), many)
> import Data.Char
> import Control.Monad
> import Prelude hiding (takeWhile)
> import Text.Parsec.String (Parser)
2015-08-01 11:13:53 +02:00
> import Data.Maybe
2015-07-31 23:04:18 +02:00
> -- | Represents a lexed token
> data Token
2016-02-15 19:20:24 +01:00
> -- | A symbol (in ansi dialect) is one of the following
2015-08-01 11:13:53 +02:00
> --
2015-07-31 23:04:18 +02:00
> -- * multi char symbols <> <= >= != ||
> -- * single char symbols: * + - < > ^ / % ~ & | ? ( ) [ ] , ; ( )
> --
> = Symbol String
>
2016-02-12 13:13:47 +01:00
> -- | This is an identifier or keyword. The first field is
> -- the quotes used, or nothing if no quotes were used. The quotes
> -- can be " or u& or something dialect specific like []
> | Identifier (Maybe (String,String)) String
2016-02-15 19:32:26 +01:00
>
2015-07-31 23:04:18 +02:00
> -- | This is a host param symbol, e.g. :param
> | HostParam String
2016-02-15 19:32:26 +01:00
>
> -- | This is a prefixed variable symbol, such as @var or #var (not used in ansi dialect)
> | PrefixedVariable Char String
>
2016-02-13 15:31:20 +01:00
> -- | This is a positional arg identifier e.g. $1
> | PositionalArg Int
2015-07-31 23:04:18 +02:00
>
2016-02-12 12:09:58 +01:00
> -- | This is a string literal. The first two fields are the --
> -- start and end quotes, which are usually both ', but can be
> -- the character set (one of nNbBxX, or u&, U&), or a dialect
> -- specific string quoting (such as $$ in postgres)
> | SqlString String String String
2016-02-15 19:32:26 +01:00
>
2015-08-01 11:13:53 +02:00
> -- | A number literal (integral or otherwise), stored in original format
2015-07-31 23:04:18 +02:00
> -- unchanged
> | SqlNumber String
>
2015-08-01 11:13:53 +02:00
> -- | Whitespace, one or more of space, tab or newline.
2015-07-31 23:04:18 +02:00
> | Whitespace String
>
2015-08-01 11:13:53 +02:00
> -- | A commented line using --, contains every character starting with the
2015-07-31 23:04:18 +02:00
> -- \'--\' and including the terminating newline character if there is one
> -- - this will be missing if the last line in the source is a line comment
> -- with no trailing newline
> | LineComment String
>
2015-08-01 11:13:53 +02:00
> -- | A block comment, \/* stuff *\/, includes the comment delimiters
2015-07-31 23:04:18 +02:00
> | BlockComment String
>
> deriving (Eq,Show)
2015-08-01 12:22:07 +02:00
> -- | Pretty printing, if you lex a bunch of tokens, then pretty
> -- print them, should should get back exactly the same string
2015-07-31 23:04:18 +02:00
> prettyToken :: Dialect -> Token -> String
> prettyToken _ (Symbol s) = s
2016-02-12 13:13:47 +01:00
> prettyToken _ (Identifier Nothing t) = t
2016-02-13 14:54:40 +01:00
> prettyToken _ (Identifier (Just (q1,q2)) t) = q1 ++ t ++ q2
2015-07-31 23:04:18 +02:00
> prettyToken _ (HostParam p) = ':':p
2016-02-15 19:32:26 +01:00
> prettyToken _ (PrefixedVariable c p) = c:p
2016-02-13 15:31:20 +01:00
> prettyToken _ (PositionalArg p) = '$':show p
2016-02-13 14:54:40 +01:00
> prettyToken _ (SqlString s e t) = s ++ t ++ e
2015-07-31 23:04:18 +02:00
> prettyToken _ (SqlNumber r) = r
> prettyToken _ (Whitespace t) = t
> prettyToken _ (LineComment l) = l
> prettyToken _ (BlockComment c) = c
> prettyTokens :: Dialect -> [Token] -> String
> prettyTokens d ts = concat $ map (prettyToken d) ts
TODO: try to make all parsers applicative only
2015-08-01 12:22:07 +02:00
> -- | Lex some SQL to a list of tokens.
2015-08-01 11:13:53 +02:00
> lexSQL :: Dialect
> -- ^ dialect of SQL to use
> -> FilePath
> -- ^ filename to use in error messages
> -> Maybe (Int,Int)
> -- ^ line number and column number of the first character
> -- in the source to use in error messages
> -> String
> -- ^ the SQL source to lex
> -> Either ParseError [((String,Int,Int),Token)]
2015-08-02 14:29:35 +02:00
> lexSQL dialect fn' p src =
> let (l',c') = fromMaybe (1,1) p
2015-08-01 11:13:53 +02:00
> in either (Left . convParseError src) Right
2015-08-02 14:29:35 +02:00
> $ runParser (setPos (fn',l',c') *> many (sqlToken dialect) <* eof) () fn' src
2015-07-31 23:04:18 +02:00
> where
2015-08-02 14:29:35 +02:00
> setPos (fn,l,c) = do
> fmap (flip setSourceName fn
> . flip setSourceLine l
> . flip setSourceColumn c) getPosition
> >>= setPosition
2015-07-31 23:04:18 +02:00
> -- | parser for a sql token
2015-08-01 11:13:53 +02:00
> sqlToken :: Dialect -> Parser ((String,Int,Int),Token)
> sqlToken d = do
> p' <- getPosition
2015-08-02 14:29:35 +02:00
> let p = (sourceName p',sourceLine p', sourceColumn p')
2015-08-01 17:08:54 +02:00
The order of parsers is important: strings and quoted identifiers can
start out looking like normal identifiers, so we try to parse these
first and use a little bit of try. Line and block comments start like
symbols, so we try these before symbol. Numbers can start with a . so
this is also tried before symbol (a .1 will be parsed as a number, but
. otherwise will be parsed as a symbol).
2015-07-31 23:04:18 +02:00
> (p,) <$> choice [sqlString d
> ,identifier d
> ,hostParam d
> ,lineComment d
> ,blockComment d
> ,sqlNumber d
2016-02-13 15:31:20 +01:00
> ,positionalArg d
2016-02-15 19:31:06 +01:00
> ,dontParseEndBlockComment d
2016-02-15 19:32:26 +01:00
> ,prefixedVariable d
2015-07-31 23:04:18 +02:00
> ,symbol d
> ,sqlWhitespace d]
2015-08-01 17:08:54 +02:00
Parses identifiers:
simple_identifier_23
u&"unicode quoted identifier"
"quoted identifier"
"quoted identifier "" with double quote char"
`mysql quoted identifier`
2015-07-31 23:04:18 +02:00
> identifier :: Dialect -> Parser Token
> identifier d =
> choice
2016-02-12 13:13:47 +01:00
> [Identifier (Just ("\"","\"")) <$> qiden
2015-08-01 17:08:54 +02:00
> -- try is used here to avoid a conflict with identifiers
> -- and quoted strings which also start with a 'u'
2016-02-12 13:13:47 +01:00
> ,Identifier (Just ("u&\"","\"")) <$> (try (string "u&") *> qiden)
> ,Identifier (Just ("U&\"","\"")) <$> (try (string "U&") *> qiden)
> ,Identifier Nothing <$> identifierString
2016-02-12 12:09:58 +01:00
> -- todo: dialect protection
2016-02-15 19:33:11 +01:00
> ,guard (diSyntaxFlavour d == MySQL) >>
> Identifier (Just ("`","`"))
> <$> (char '`' *> takeWhile1 (/='`') <* char '`')
> ,guard (diSyntaxFlavour d == SQLServer) >>
> Identifier (Just ("[","]"))
> <$> (char '[' *> takeWhile1 (`notElem` "[]") <* char ']')
2015-07-31 23:04:18 +02:00
> ]
> where
> qiden = char '"' *> qidenSuffix ""
> qidenSuffix t = do
> s <- takeTill (=='"')
> void $ char '"'
> -- deal with "" as literal double quote character
> choice [do
> void $ char '"'
2016-02-13 14:54:40 +01:00
> qidenSuffix $ concat [t,s,"\"\""]
2015-07-31 23:04:18 +02:00
> ,return $ concat [t,s]]
2016-02-15 19:33:11 +01:00
2015-07-31 23:04:18 +02:00
2015-08-01 17:08:54 +02:00
This parses a valid identifier without quotes.
2015-07-31 23:04:18 +02:00
> identifierString :: Parser String
> identifierString =
2016-02-15 19:34:04 +01:00
> startsWith (\c -> c == '_' || isAlpha c) isIdentifierChar
2015-07-31 23:04:18 +02:00
2016-02-15 19:34:04 +01:00
this can be moved to the dialect at some point
> isIdentifierChar :: Char -> Bool
> isIdentifierChar c = c == '_' || isAlphaNum c
2015-07-31 23:04:18 +02:00
2015-08-01 17:08:54 +02:00
Parse a SQL string. Examples:
'basic string'
'string with '' a quote'
n'international text'
b'binary string'
x'hexidecimal string'
2015-07-31 23:04:18 +02:00
> sqlString :: Dialect -> Parser Token
2016-02-13 16:07:27 +01:00
> sqlString d = dollarString <|> csString <|> normalString
2015-07-31 23:04:18 +02:00
> where
2016-02-13 16:07:27 +01:00
> dollarString = do
> guard $ diSyntaxFlavour d == Postgres
> -- use try because of ambiguity with symbols and with
> -- positional arg
> s <- choice
> [do
> i <- try (char '$' *> identifierString <* char '$')
> return $ "$" ++ i ++ "$"
> ,try (string "$$")
> ]
> str <- manyTill anyChar (try $ string s)
> return $ SqlString s s str
> normalString = SqlString "'" "'" <$> (char '\'' *> normalStringSuffix False "")
> normalStringSuffix allowBackslash t = do
> s <- takeTill $ if allowBackslash
> then (`elem` "'\\")
> else (== '\'')
2016-02-13 15:31:20 +01:00
> -- deal with '' or \' as literal quote character
> choice [do
> ctu <- choice ["''" <$ try (string "''")
> ,"\\'" <$ string "\\'"
> ,"\\" <$ char '\\']
2016-02-13 16:07:27 +01:00
> normalStringSuffix allowBackslash $ concat [t,s,ctu]
2016-02-13 15:31:20 +01:00
> ,concat [t,s] <$ char '\'']
2015-08-01 17:08:54 +02:00
> -- try is used to to avoid conflicts with
> -- identifiers which can start with n,b,x,u
> -- once we read the quote type and the starting '
> -- then we commit to a string
2016-02-13 14:54:40 +01:00
> -- it's possible that this will reject some valid syntax
> -- but only pathalogical stuff, and I think the improved
> -- error messages and user predictability make it a good
> -- pragmatic choice
2016-02-13 16:07:27 +01:00
> csString
> | diSyntaxFlavour d == Postgres =
> choice [SqlString <$> try (string "e'" <|> string "E'")
> <*> return "'" <*> normalStringSuffix True ""
> ,csString']
> | otherwise = csString'
> csString' = SqlString
> <$> try cs
> <*> return "'"
> <*> normalStringSuffix False ""
> csPrefixes = "nNbBxX"
2016-02-13 15:31:20 +01:00
> cs = choice $ (map (\x -> string ([x] ++ "'")) csPrefixes)
2016-02-12 12:09:58 +01:00
> ++ [string "u&'"
> ,string "U&'"]
2015-07-31 23:04:18 +02:00
> hostParam :: Dialect -> Parser Token
2016-02-13 15:31:20 +01:00
use try for postgres because we also support : and :: as symbols
There might be a problem with parsing e.g. a[1:b]
> hostParam d | diSyntaxFlavour d == Postgres =
> HostParam <$> try (char ':' *> identifierString)
2015-07-31 23:04:18 +02:00
> hostParam _ = HostParam <$> (char ':' *> identifierString)
2016-02-15 19:32:26 +01:00
> prefixedVariable :: Dialect -> Parser Token
> prefixedVariable d | diSyntaxFlavour d == SQLServer =
> PrefixedVariable <$> char '@' <*> identifierString
> prefixedVariable d | diSyntaxFlavour d == Oracle =
> PrefixedVariable <$> char '#' <*> identifierString
> prefixedVariable _ = guard False *> fail "unpossible"
2016-02-13 15:31:20 +01:00
> positionalArg :: Dialect -> Parser Token
> positionalArg d | diSyntaxFlavour d == Postgres =
> -- use try to avoid ambiguities with other syntax which starts with dollar
> PositionalArg <$> try (char '$' *> (read <$> many1 digit))
2016-02-15 19:32:26 +01:00
> positionalArg _ = guard False *> fail "unpossible"
2015-07-31 23:04:18 +02:00
digits
digits.[digits][e[+-]digits]
[digits].digits[e[+-]digits]
digitse[+-]digits
where digits is one or more decimal digits (0 through 9). At least one
digit must be before or after the decimal point, if one is used. At
least one digit must follow the exponent marker (e), if one is
present. There cannot be any spaces or other characters embedded in
the constant. Note that any leading plus or minus sign is not actually
considered part of the constant; it is an operator applied to the
constant.
2015-08-01 17:08:54 +02:00
> sqlNumber :: Dialect -> Parser Token
2016-02-15 19:31:06 +01:00
> sqlNumber _ =
> SqlNumber <$> completeNumber
> -- this is for definitely avoiding possibly ambiguous source
> <* notFollowedBy (oneOf "eE.")
2015-08-01 17:08:54 +02:00
> where
2016-02-15 19:31:06 +01:00
> completeNumber =
> (int <??> (pp dot <??.> pp int)
> -- try is used in case we read a dot
> -- and it isn't part of a number
> -- if there are any following digits, then we commit
> -- to it being a number and not something else
> <|> try ((++) <$> dot <*> int))
> <??> pp expon
2015-08-01 17:08:54 +02:00
> int = many1 digit
> dot = string "."
> expon = (:) <$> oneOf "eE" <*> sInt
> sInt = (++) <$> option "" (string "+" <|> string "-") <*> int
> pp = (<$$> (++))
2015-07-31 23:04:18 +02:00
A symbol is one of the two character symbols, or one of the single
character symbols in the two lists below.
> symbol :: Dialect -> Parser Token
2016-02-13 15:31:20 +01:00
> symbol d | diSyntaxFlavour d == Postgres =
2016-02-13 19:28:12 +01:00
> Symbol <$> choice (otherSymbol ++ [singlePlusMinus,opMoreChars])
rules
An operator name is a sequence of up to NAMEDATALEN-1 (63 by default) characters from the following list:
+ - * / < > = ~ ! @ # % ^ & | ` ?
There are a few restrictions on operator names, however:
-- and /* cannot appear anywhere in an operator name, since they will be taken as the start of a comment.
A multiple-character operator name cannot end in + or -, unless the name also contains at least one of these characters:
~ ! @ # % ^ & | ` ?
> where
> -- other symbols are all the tokens which parse as symbols in
> -- this lexer which aren't considered operators in postgresql
> -- a single ? is parsed as a operator here instead of an other
> -- symbol because this is the least complex way to do it
> otherSymbol = many1 (char '.') :
> (map (try . string) ["::", ":="]
2016-02-15 19:33:37 +01:00
> ++ map (string . (:[])) "[],;():"
> ++ if allowOdbc d
> then [string "{", string "}"]
> else []
> )
2016-02-13 19:28:12 +01:00
exception char is one of:
~ ! @ # % ^ & | ` ?
which allows the last character of a multi character symbol to be + or
-
> allOpSymbols = "+-*/<>=~!@#%^&|`?"
> -- these are the symbols when if part of a multi character
> -- operator permit the operator to end with a + or - symbol
> exceptionOpSymbols = "~!@#%^&|`?"
2016-02-13 19:34:50 +01:00
2016-02-13 19:28:12 +01:00
> -- special case for parsing a single + or - symbol
> singlePlusMinus = try $ do
2016-02-13 19:34:50 +01:00
> c <- oneOf "+-"
> notFollowedBy $ oneOf allOpSymbols
2016-02-13 19:28:12 +01:00
> return [c]
> -- this is used when we are parsing a potentially multi symbol
> -- operator and we have alread seen one of the 'exception chars'
> -- and so we can end with a + or -
> moreOpCharsException = do
2016-02-15 19:31:06 +01:00
> c <- oneOf (filter (`notElem` "-/*") allOpSymbols)
2016-02-13 19:34:50 +01:00
> -- make sure we don't parse a comment starting token
> -- as part of an operator
> <|> try (char '/' <* notFollowedBy (char '*'))
> <|> try (char '-' <* notFollowedBy (char '-'))
2016-02-15 19:31:06 +01:00
> -- and make sure we don't parse a block comment end
> -- as part of another symbol
> <|> try (char '*' <* notFollowedBy (char '/'))
2016-02-13 19:28:12 +01:00
> (c:) <$> option [] moreOpCharsException
> opMoreChars = choice
2016-02-13 19:34:50 +01:00
> [-- parse an exception char, now we can finish with a + -
> (:)
> <$> oneOf exceptionOpSymbols
> <*> option [] moreOpCharsException
> ,(:)
> <$> (-- parse +, make sure it isn't the last symbol
> try (char '+' <* lookAhead (oneOf allOpSymbols))
> <|> -- parse -, make sure it isn't the last symbol
> -- or the start of a -- comment
> try (char '-'
2016-02-13 19:28:12 +01:00
> <* notFollowedBy (char '-')
2016-02-13 19:34:50 +01:00
> <* lookAhead (oneOf allOpSymbols))
> <|> -- parse / check it isn't the start of a /* comment
> try (char '/' <* notFollowedBy (char '*'))
2016-02-15 19:31:06 +01:00
> <|> -- make sure we don't parse */ as part of a symbol
> try (char '*' <* notFollowedBy (char '/'))
2016-02-13 19:34:50 +01:00
> <|> -- any other ansi operator symbol
2016-02-15 19:31:06 +01:00
> oneOf "<>=")
2016-02-13 19:34:50 +01:00
> <*> option [] opMoreChars
2016-02-13 19:28:12 +01:00
> ]
2016-02-15 19:20:24 +01:00
> symbol d | diSyntaxFlavour d == SQLServer =
> Symbol <$> choice (otherSymbol ++ regularOp)
> where
> otherSymbol = many1 (char '.') :
2016-02-15 19:33:37 +01:00
> (map (string . (:[])) ",;():?"
> ++ if allowOdbc d
> then [string "{", string "}"]
> else [])
2016-02-15 19:20:24 +01:00
try is used because most of the first characters of the two character
symbols can also be part of a single character symbol
2016-02-15 19:31:06 +01:00
> regularOp = map (try . string) [">=","<=","!=","<>"]
> ++ map (string . (:[])) "+-^*/%~&<>="
> ++ [char '|' *>
> choice ["||" <$ char '|' <* notFollowedBy (char '|')
> ,return "|"]]
2016-02-13 19:28:12 +01:00
2016-02-15 19:33:37 +01:00
> symbol d =
2016-02-13 19:28:12 +01:00
> Symbol <$> choice (otherSymbol ++ regularOp)
> where
> otherSymbol = many1 (char '.') :
2016-02-15 19:33:37 +01:00
> (map (string . (:[])) "[],;():?"
> ++ if allowOdbc d
> then [string "{", string "}"]
> else [])
2016-02-13 19:28:12 +01:00
try is used because most of the first characters of the two character
symbols can also be part of a single character symbol
2016-02-15 19:31:06 +01:00
> regularOp = map (try . string) [">=","<=","!=","<>"]
> ++ map (string . (:[])) "+-^*/%~&<>=[]"
> ++ [char '|' *>
> choice ["||" <$ char '|' <* notFollowedBy (char '|')
> ,return "|"]]
2016-02-13 19:28:12 +01:00
2015-07-31 23:04:18 +02:00
> sqlWhitespace :: Dialect -> Parser Token
> sqlWhitespace _ = Whitespace <$> many1 (satisfy isSpace)
> lineComment :: Dialect -> Parser Token
> lineComment _ =
> (\s -> LineComment $ concat ["--",s]) <$>
2015-08-01 17:08:54 +02:00
> -- try is used here in case we see a - symbol
> -- once we read two -- then we commit to the comment token
2015-08-02 14:29:35 +02:00
> (try (string "--") *> (
> -- todo: there must be a better way to do this
> conc <$> manyTill anyChar (lookAhead lineCommentEnd) <*> lineCommentEnd))
> where
> conc a Nothing = a
> conc a (Just b) = a ++ b
> lineCommentEnd =
> Just "\n" <$ char '\n'
> <|> Nothing <$ eof
2015-07-31 23:04:18 +02:00
2015-08-01 17:08:54 +02:00
Try is used in the block comment for the two symbol bits because we
want to backtrack if we read the first symbol but the second symbol
isn't there.
2015-07-31 23:04:18 +02:00
> blockComment :: Dialect -> Parser Token
> blockComment _ =
> (\s -> BlockComment $ concat ["/*",s]) <$>
> (try (string "/*") *> commentSuffix 0)
> where
> commentSuffix :: Int -> Parser String
> commentSuffix n = do
> -- read until a possible end comment or nested comment
> x <- takeWhile (\e -> e /= '/' && e /= '*')
> choice [-- close comment: if the nesting is 0, done
> -- otherwise recurse on commentSuffix
> try (string "*/") *> let t = concat [x,"*/"]
> in if n == 0
> then return t
> else (\s -> concat [t,s]) <$> commentSuffix (n - 1)
> -- nested comment, recurse
> ,try (string "/*") *> ((\s -> concat [x,"/*",s]) <$> commentSuffix (n + 1))
> -- not an end comment or nested comment, continue
2015-08-01 12:22:07 +02:00
> ,(\c s -> x ++ [c] ++ s) <$> anyChar <*> commentSuffix n]
2015-07-31 23:04:18 +02:00
2016-02-15 19:31:06 +01:00
This is to improve user experience: provide an error if we see */
outside a comment. This could potentially break postgres ops with */
in (which is a stupid thing to do). In other cases, the user should
write * / instead (I can't think of any cases when this would be valid
syntax though).
> dontParseEndBlockComment :: Dialect -> Parser Token
> dontParseEndBlockComment _ =
> -- don't use try, then it should commit to the error
> try (string "*/") *> fail "comment end without comment start"
2015-08-01 17:08:54 +02:00
Some helper combinators
2015-07-31 23:04:18 +02:00
> startsWith :: (Char -> Bool) -> (Char -> Bool) -> Parser String
> startsWith p ps = do
> c <- satisfy p
> choice [(:) c <$> (takeWhile1 ps)
> ,return [c]]
2015-08-01 12:22:07 +02:00
> takeWhile1 :: (Char -> Bool) -> Parser String
> takeWhile1 p = many1 (satisfy p)
> takeWhile :: (Char -> Bool) -> Parser String
> takeWhile p = many (satisfy p)
> takeTill :: (Char -> Bool) -> Parser String
2016-02-15 19:20:24 +01:00
> takeTill p = manyTill anyChar (peekSatisfy p)
2015-08-01 12:22:07 +02:00
> peekSatisfy :: (Char -> Bool) -> Parser ()
2016-02-15 19:20:24 +01:00
> peekSatisfy p = void $ lookAhead (satisfy p)
This utility function will accurately report if the two tokens are
pretty printed, if they should lex back to the same two tokens. This
function is used in testing (and can be used in other places), and
must not be implemented by actually trying to print and then lex
(because then we would have the risk of thinking two tokens cannot be
together when there is bug in the lexer and it should be possible to
put them together.
2016-02-13 10:40:49 +01:00
2016-02-15 19:20:24 +01:00
question: maybe pretty printing the tokens separately and then
analysing the concrete syntax without concatting the two printed
tokens together is a better way of doing this?
2016-02-13 10:40:49 +01:00
2016-02-15 19:20:24 +01:00
maybe do some quick checking to make sure this function only gives
true negatives: check pairs which return false actually fail to lex or
give different symbols in return
2016-02-13 10:40:49 +01:00
2016-02-15 19:31:06 +01:00
a good sanity test for this function is to change it to always return
true, then check that the automated tests return the same number of
successes.
2016-02-15 19:20:24 +01:00
> tokenListWillPrintAndLex :: Dialect -> [Token] -> Bool
> tokenListWillPrintAndLex _ [] = True
> tokenListWillPrintAndLex _ [_] = True
> tokenListWillPrintAndLex d (a:b:xs) =
> tokensWillPrintAndLex d a b && tokenListWillPrintAndLex d (b:xs)
> tokensWillPrintAndLex :: Dialect -> Token -> Token -> Bool
2016-02-15 19:34:04 +01:00
TODO: add more memoization, e.g. create a wrapper which pretty prints
both tokens so the pretty printed token can be reused in multiple
cases.
a : followed by an identifier character will look like a host param
followed by = or : makes a different symbol
> tokensWillPrintAndLex d (Symbol ":") b
> | (b':_) <- prettyToken d b
> , isIdentifierChar b' || b' `elem` ":=" = False
2016-02-15 19:20:24 +01:00
two symbols next to eachother will fail if the symbols can combine and
(possibly just the prefix) look like a different symbol, or if they
combine to look like comment markers
2016-02-15 19:34:04 +01:00
> tokensWillPrintAndLex (Dialect {diSyntaxFlavour = Postgres}) (Symbol a) (Symbol x)
> | x `notElem` ["+", "-"] = False
> | or (map (`elem` a) "~!@#%^&|`?") = False
> tokensWillPrintAndLex _ (Symbol s1) (Symbol s2)
> | (s1,s2) `elem`
> [("<",">")
> ,("<","=")
> ,(">","=")
> ,("!","=")
> ,("|","|")
> ,("||","|")
> ,("|","||")
> ,("||","||")
> ,("<",">=")
> ] = False
List explicitly all the cases which should fail
2016-02-15 19:20:24 +01:00
two whitespaces will be combined
> tokensWillPrintAndLex _ Whitespace {} Whitespace {} = False
line comment without a newline at the end will eat the next token
> tokensWillPrintAndLex _ (LineComment s@(_:_)) _ = last s == '\n'
this should never happen, but the case satisfies the haskell compiler
and isn't exactly wrong
> tokensWillPrintAndLex _ (LineComment []) _ = False
2016-02-15 19:34:04 +01:00
a token which ends with - followed by another token which starts with
- will turn into a line comment
> tokensWillPrintAndLex d a b
> | (a'@(_:_),('-':_)) <- (prettyToken d a, prettyToken d b)
> , last a' == '-' = False
a token which ends with * followed by a / at the start of the next
token will cause a problem
> tokensWillPrintAndLex d a b
> | (a'@(_:_),('/':_)) <- (prettyToken d a, prettyToken d b)
> , last a' == '*' = False
The reverse is a problem also: ending with / then the next one
starting with * will create the start of a block comment
todo: write a helper function for a predicate on the last char of the first token and the first char of the second token since this appears quite a few times
> tokensWillPrintAndLex d a b
> | (a'@(_:_),('*':_)) <- (prettyToken d a, prettyToken d b)
> , last a' == '/' = False
a symbol will absorb a following .
TODO: not 100% on this
> tokensWillPrintAndLex d Symbol {} b
> | ('.':_) <- prettyToken d b = False
unquoted identifier followed by an identifier letter
> tokensWillPrintAndLex d (Identifier Nothing _) b
> | (b':_) <- prettyToken d b
> , isIdentifierChar b' = False
two quoted identifiers with the same quote next to each other will
parse back as one identifier with the quote symbol in the middle
> tokensWillPrintAndLex _ (Identifier (Just (_,[a])) _) (Identifier (Just ([b],_)) _)
> | a == b = False
host param followed by an identifier char will be absorbed
> tokensWillPrintAndLex d HostParam {} b
> | (b':_) <- prettyToken d b
> , isIdentifierChar b' = False
prefixed variable same:
> tokensWillPrintAndLex d PrefixedVariable {} b
> | (b':_) <- prettyToken d b
> , isIdentifierChar b' = False
a positional arg will absorb a following digit
> tokensWillPrintAndLex d PositionalArg {} b
> | (b':_) <- prettyToken d b
> , isDigit b' = False
a string ending with ' followed by a token starting with ' will be absorbed
> tokensWillPrintAndLex d (SqlString _q00 "'" _s0) b
> | ('\'':_) <- prettyToken d b = False
a number followed by a . will fail or be absorbed
> tokensWillPrintAndLex d SqlNumber {} b
> | ('.':_) <- prettyToken d b = False
a number followed by an e or E will fail or be absorbed
> tokensWillPrintAndLex d SqlNumber {} b
> | ('e':_) <- prettyToken d b = False
> | ('E':_) <- prettyToken d b = False
two numbers next to eachother will fail or be absorbed
> tokensWillPrintAndLex _ SqlNumber {} SqlNumber {} = False
> tokensWillPrintAndLex _ _ _ = True
2016-02-15 19:20:24 +01:00
todo: special case lexer so a second ., and . and e are not
allowed after exponent when there is no whitespace, even if there
is an unambiguous parse
TODO:
2016-02-15 19:31:06 +01:00
refactor the tokenswillprintlex to be based on pretty printing the
2016-02-15 19:33:11 +01:00
individual tokens
make the tokenswill print more dialect accurate. Maybe add symbol
chars and identifier chars to the dialect definition and use them from
here
2016-02-15 19:31:06 +01:00
start adding negative / different parse dialect tests
2016-02-15 19:20:24 +01:00
add token tables and tests for oracle, sql server
2016-02-15 19:31:06 +01:00
review existing tables
2016-02-15 19:32:26 +01:00
look for refactoring opportunities, especially the token
generation tables in the tests
2016-02-15 19:20:24 +01:00
add odbc as a dialect flag and include {} as symbols when enabled
do some user documentation on lexing, and lexing/dialects
2016-02-13 10:40:49 +01:00
2016-02-15 19:20:24 +01:00
start thinking about a more separated design for the dialect handling
2016-02-15 19:34:04 +01:00
make sure other symbols repeated are protected like | || where neccessary
such as :