AFL operators and macros as R functions
A SciDB database connection object returned by scidb::scidbconnect()
contains a list of available AFL operators and macros presented as R functions. The functions return scidb objects, and those objects may be used as function arguments to compose complex AFL query expressions. RStudio users can use code completion to find available operations (via tab completion on db$
) and inside the function body to see suggested AFL operator argument completions. (via tab completion on, e.g., db$build()
).
x <- db$build("<v:double>[i=1:2,2,0, j=1:3,1,0]", i * j)
y <- db$apply(x, a, 2 * v, b, "'a character string'")
print(y)
SciDB expression apply(build(<v:double>[i=1:2,2,0, j...
SciDB schema <v:double,a:double,b:string NOT NULL> [i=1:2:0:2; j=1:3:0:1]
variable dimension type nullable start end chunk
1 i TRUE int64 FALSE 1 2 2
2 j TRUE int64 FALSE 1 3 1
3 v FALSE double TRUE
4 a FALSE double TRUE
5 b FALSE string FALSE
Note that in the above expression i
and j
are treated verbatim as SciDB dimension values, not R values. The function argument evaluation procedure in SciDB AFL methods is, in order of precedence:
- Arguments that evaluate to SciDB arrays are replaced by their corresponding SciDB expression.
- Character strings are replaced by their unquoted value. (This means that you usually need to supply quote marks (’) in values you wish SciDB to treat as strings.)
- All remaining arguments are passed verbatim into the SciDB query expression, unless escaped by the special
R()
function (see next section).
Certain SciDB schema literal values represent one important exception to the above process. SciDB schemas contain characters like <
, :
, [
that correspond to R infix operators. It’s difficult to always overload their evaluation as R operators. Thus, SciDB schemas must simply always be supplied as R character strings instead of verbatim text as shown in the following example (also used above).
x <- db$build("<v:double>[i=1:2,2,0, j=1:3,1,0]", i * j)
Using scalar-valued R expressions within AFL statements
Use the special R()
function syntax within any portion of an AFL statement to replace the enclosed R expression with its scalar value. The R expression is evaluated in the calling environment (the parent frame of the AFL function). The evaluated R expression must be a scalar character, numeric, or logical value. The next example uses two R values, one for the upper limit of the j
dimension and another in the build statement.
jlim <- 3
z <- pi
x <- db$build("<v:double>[i=1:2,2,0, j=1:R(jlim),1,0]", i * j + R(z))
Use this syntax to mix scalar R values into SciDB AFL expressions.
Getting help for AFL expressions
Use the TAB key for argument lists within an AFL function (RStudio only). The standard R help()
function shows SciDB doxygen help for AFL function arguments, for instance
scidb::help(db$aggregate)
aggregate( srcArray {, AGGREGATE_CALL}+ {, groupbyDim}* {, chunkSize}* )
Calculates aggregates over groups of values in an array, given the
aggregate types and attributes to aggregate on.
Input
- srcArray: a source array with srcAttrs and srcDims.
- 1 or more aggregate calls.
Each aggregate call has an AGGREGATE_FUNC, an inputAttr and a
resultName. The default resultName is inputAttr followed by '_'
and then AGGREGATE_FUNC. For instance, the default resultName
for sum(sales) is 'sales_sum'. The count aggregate may take *
as the input attribute, meaning to count all the items in the
group including null items. The default resultName for count(*)
is 'count'.
- 0 or more dimensions that together determines the grouping
criteria.
- 0 or numGroupbyDims chunk sizes.
If no chunk size is given, the groupby dims will inherit chunk
sizes from the input array. If at least one chunk size is given,
the number of chunk sizes must be equal to the number of groupby
dimensions, and the groupby dimensions will use the specified
chunk sizes.
Output array
<
The aggregate calls' resultNames.
>
[
The list of groupbyDims if provided (with the specified chunk sizes
if provided),
or
'i' if no groupbyDim is provided.
]
Examples
- Given array A <quantity: uint64, sales:double> [year, item] =
year, item, quantity, sales
2011, 2, 7, 31.64
2011, 3, 6, 19.98
2012, 1, 5, 41.65
2012, 2, 9, 40.68
2012, 3, 8, 26.64
- aggregate(A, count(*), max(quantity), sum(sales), year) <count:
uint64, quantity_max: uint64, sales_sum: double> [year] =
year, count, quantity_max, sales_sum
2011, 2, 7, 51.62
2012, 3, 9, 108.97
Errors
n/a
Notes
- All the aggregate functions ignore null values, except count(*).
/
Alternatively, supply a character string argument naming a SciDB operator to the aflhelp()
function to show SciDB help for the operator.
scidb::aflhelp(db, "aggregate")
SciDB aliasing in AFL expressions
Use the special %as%
R infix operator in place of the usual AFL word “as” in SciDB expressions using aliasing, either to rename a SciDB array, expression, or certain AFL expression arguments often in aggregations.
Warning in .PreprocessDfTypes(X, types, use_aio_input): Attribute names have been changed
instance_id value_no Species avg
1 0 0 setosa 1.462
2 0 1 virginica 5.552
3 2 0 versicolor 4.260
Mapping R expressions to AFL expressions
Creating filters for SciDB queries typically involves writing boolean expressions, whose AFL syntax can differ from R syntax in many ways. To ease this transition, the experimental scidb::expression_parser()
, scidb::filter()
, and scidb::filter_to_afl()
methods offer some support for mapping R syntax to AFL syntax, as well as enabling other convenient methods for creating filter expressions.
Synopsis
The scidb::filter
utility allows users to create symbolic boolean expressions referring to attributes present in SciDB arrays as well as local variables in the R environment. These can be evaluated in the SciDB database and used to filter the data before returning them to the user, for improved performance.
The output of the function scidb::filter
is a list of objects in an R6 class hierarchy inheriting from scidb::scidb.expression
, and some helpful methods are available as described in its documentation. Alternatively, the method scidb::filter_to_afl
returns a scalar AFL strings, equivalent to the boolean conjunction of all inputs expressions.
Syntax
The scidb::filter
and scidb::filter_to_afl
functions support a syntax similar to regular R syntax, and allow the use of various numeric and string operators, as outlined in its documentation page. There are three basic formats for a filter expression: * an expression made up of symbols corresponding to properties in SciDB tables, literal and interpolated values, and various operators * a symbol name – predicate pair in the form symbol = predicate
, where the predicate is an anonymous unary predicate or composition of such objects * a symbol name – value pair in the form symbol = value
, where the value is a scalar or vector quantity and is interpreted as requiring that the value of the corresponding property be equal to one of the listed options.
Regular logical operators are supported. A number of allowed binary comparison operators acting on SciDB types are supported as well. Constant values may be used inserted into a filter expression either as R literals or user variables prefixed with !!
; see the section on value substitution for detail on the latter. Finally, some arithmetic and string functions can be used to build more complicated expressions.
Symbols present in the expression, besides those reserved for various operators, are assumed to correspond to attributes or dimensions of the table being queried, but are not validated until the query is executed by the database.
Logical operators
A scidb::filter
expression may contain a comma-delimited list of expressions, which is interpreted as the logical conjunction of the separate expressions – i.e., requiring all the expressions to be evaluated as true. This can also be done via the logical AND operator &&
, demonstrated in the below, equivalent queries:
scidb::filter_to_afl(A, B, C)
[1] "(A and B) and C"
scidb::filter_to_afl(A && B && C)
[1] "(A and B) and C"
The logical OR operator ||
can be used to require at least one of multiple expressions to be true:
scidb::filter_to_afl(A || B || C)
[1] "(A or B) or C"
Finally, the logical negation operator !
is supported, as in these equivalent expressions, which evaluate to the same AFL due to logic pushing down negation via De Morgan’s laws:
scidb::filter_to_afl(!(A || B || C))
[1] "(not(A) and not(B)) and not(C)"
scidb::filter_to_afl(!A && !B && !C)
[1] "(not(A) and not(B)) and not(C)"
The logical precedence of these operators follows the standard convention that !
takes the highest precedence, followed by &&
, and finally ||
with the lowest precedence.
Comparison operators
Generic comparison operators
The ==
and !=
are supported for both string fields and numeric fields:
scidb::filter_to_afl(x != 'unknown')
[1] "x != 'unknown'"
When the two sides of these operators are different types, normal casting rules are applied if possible; e.g. different arithemtic types may be compared, but a string value may not be compared to an arithmetic value. When the field takes on the null value, the result of these expressions is always null.
Numeric comparison operators
The supported numeric comparison operators are >
, >=
, <
, and <=
, as shown in these examples:
scidb::filter_to_afl(b > 1000)
[1] "b > 1000"
scidb::filter_to_afl(b > 2000 || c >= 60)
[1] "(b > 2000) or (c >= 60)"
When the two sides of these comparison operators are different types, normal casting rules are applied. When the field takes on the null value, the result of these expressions is always null.
String comparison operators
The supported comparison operators for string fields are the binary operators %like%
, %contains%
, %starts_with%
, and %ends_with%
.
The %like%
operator takes a regular expression as the right-hand operand. Inside the regular expression pattern, special characters such as *
and .
take special meanings; please refer to the R regular expression syntax for more details. In this example, the pattern [*][0-9]{2,}
matches a literal *
asterisk followed by at least 2 digits, and the initial and terminal .*
placeholders mean that this pattern can be present anywhere in the annotation string:
scidb::filter_to_afl(x %like% '.*[*][0-9]{2,}.*')
[1] "regex(x, '(?i).*[*][0-9]{2,}.*')"
The operators %contains%
, %starts_with%
, and %ends_with%
are convenience methods for substring matching, with the former allowing a floating position for the substring specified as the right-hand operand and the latter fixing the substring’s allowed position. For example, to require a string to end in a literal *
, one could create the filter
scidb::filter_to_afl(x %ends_with% '[*]')
[1] "regex(x, '(?i).*\\[\\*\\]')"
All four of these methods require escaping special regex characters to match them literally, with the simplest method being to create single element character classes, e.g. [*]
as shown above.
Set membership operators
The operators %in%
and %not_in%
are set operators, testing the membership of left-hand operand in the vector in the right-hand operand. Those vectors can be included as literals in the expression or as interpolated R variables:
scidb::filter_to_afl(a %in% c(3,5,7))
[1] "((a = 3) or (a = 5)) or (a = 7)"
Value substitution
A filter expression can directly use literal R characters and numeric values, as shown here where the literal value 1000 is used directly:
scidb::filter_to_afl(b > 1000)
[1] "b > 1000"
Alternatively, a user might prefer to store the value in an R variable. In this case, the variable must be prefixed with !!
:
aValues <- c(3,5,7)
scidb::filter_to_afl(a %in% !!aValues)
[1] "((a = 3) or (a = 5)) or (a = 7)"
myThreshold <- 1000
scidb::filter_to_afl(b > !!myThreshold)
[1] "b > 1000"
If the !!
is omitted, the variable name will be assumed to be a field in the SciDB array, resulting in either an incorrect result when the name happens to be a field in the array or, more likely, an error when SciDB cannot find a field of the given name:
scidb::filter_to_afl(b > myThreshold) # error -- should add !! before myThreshold
[1] "b > myThreshold"
These errors cannot be detected when the filter expression is created, and will not be apparent until the query is executed on the SciDB server.
If the name of the SciDB field is itself stored in a variable, it can be injected into an expression via the use of the rlang::sym
function:
key <- "x"
scidb::filter_to_afl(!!rlang::sym(key) == "value")
[1] "x = 'value'"
Mutated fields in filter expressions
The various examples above show how a single field can be compared to a reference literal or symbolic value. The values of fields can also be modified using appropriate functions acting on one or more string or numeric type, to build more complicated boolean expressions.
String functions
The binary operator +
can be used to as a string concatenation operator, to join strings from different fields and/or literals. For example, the filter below first constructs a synthetic field a + ':' + b
, then requires that the synthetic field must be one of the strings specified with a %in%
function.
scidb::filter_to_afl(a + ':' + b == "A:B")
[1] "((a + ':') + b) = 'A:B'"
If preferred, it is also possible to use the regular R paste0
method, though it is currently limited to operating as a binary expression as shown in the filter here, completely equivalent to the one above:
scidb::filter_to_afl(paste0(paste0(a,':'),b) == "A:B")
[1] "((a + ':') + b) = 'A:B'"
Additionally, the unary nchar
operator (or, alternatively, strlen
) returns the number of characters in a string field, a value that can then be used in numeric comparisons:
scidb::filter_to_afl(nchar(a) <= 12)
[1] "strlen(a) <= 12"
Arithmetic functions
The binary arithmetic operators +
, -
, *
, and /
can all be used to combine numeric fields, and the results can be filtered with any of the regular numeric comparison operators.
scidb::filter_to_afl( a * b >= 1000)
[1] "(a * b) >= 1000"
Typical rules apply when comparison values of different numeric types or when attempting to divide by zero.
A note about operator precedence
Expressions that are a function of a field’s value must be wrapped in parentheses if they are meant to be the left-hand operand of the %in%
or %not_in%
operators, due to R’s normal precendence rules (save when the expression is just nchar(...)
). A safe practice would be to always wrap these expressions in parenthesis, to prevent any ambiguity in case a ==
or !=
were ever replaced with %in%
or %not_in%
.
Unary predicate expressions
The following unary predicate expressions are all shorthand methods for creating a scidb::filter
object for an arbitrary (anonymous) attribute. In this syntax, the argument to scidb::filter
is a named key-value pair whose key is the name of the attribute or dimension to which the value, a predicate, is to be applied. which takes a list of fields (as names) and their corresponding allowed values or predicates, applied in conjunction. For example, to restrict a field a
to be one of 3, 5, or 7, and simultaneously restrict b
to be equal to 6, one can write
scidb::filter_to_afl(
a = IN(c(3,5,7)),
b = EQUALS(6)
)
[1] "(((a = 3) or (a = 5)) or (a = 7)) and (b = 6)"
For convenience, the list values can be set to a primitive value or vector of values, as a shortcut for the common cases of equality or membership in a set:
scidb::filter_to_afl(
a = c(3,5,7),
b = 6
)
[1] "(((a = 3) or (a = 5)) or (a = 7)) and (b = 6)"
Additionally, predicates can be contained logically using the !
, &
, and |
operations, for example to allow a field to be in one of two ranges with
scidb::filter_to_afl(a = IN_RANGE(5,10) | IN_RANGE(15,20))
[1] "((a >= 5) and (a <= 10)) or ((a >= 15) and (a <= 20))"
When defining unary predicates, one can take advantage of the rlang :=
syntax to set the predicate for a field whose name is stored in an R variable:
key <- "x"
scidb::filter_to_afl(!!key := IN_RANGE(5,10) | IN_RANGE(15,20))
[1] "((x >= 5) and (x <= 10)) or ((x >= 15) and (x <= 20))"
Arithmetic
- Greater than:
GT()
- Greater than or equal to:
GEQ()
- Less than:
LT()
- Less than or equal to:
LEQ()
- In range (inclusive):
IN_RANGE()
andIN_RANGE_INCL()
- In range (exclusive):
IN_RANGE_EXCL()
String matching predicates
- Matching a regexp:
LIKE()
- Matching a substring:
CONTAINS()
- Matching a prefix:
STARTS_WITH()
- Matching a suffix:
ENDS_WITH()
Full list of reserved operators and symbols in filter expressions
- Logical operators:
,
,&
,&&
,|
,||
,!
- Numeric comparison operators:
==
,!=
,>
,>=
,<
,<=
- String comparison operators:
==
,!=
,%contains%
,%starts_with%
,%ends_with%
,%like%
- Set membership operators:
%in%
,%not_in%
- The special symbol
!!
to flag symbols in the local R environment - String and arithmetic functions:
+
,-
,*
,/
,nchar
- Predicates:
EQUALS
,NEQ
,IN
,NOT_IN
,IS_NULL
,NOT_NULL
- Arithmetic predicates:
GEQ
,GT
,LEQ
,LT
,IN_RANGE
,IN_RANGE_INCL
,IN_RANGE_EXCL
- Regex predicates:
LIKE
,CONTAINS
,STARTS_WITH
,ENDS_WITH