A Condition
is a class for building custom data quality checks. Simply create a condition, and after the run is processed your conditions will be evaluated. Integrate with email or slack to have condition results alerting via a Run Report. Use Conditions to answer questions such as “Is the average confidence for my training data below 0.25” or “Has over 20% of my inference data drifted”.
What do I do with Conditions?
You can build a Run Report
that will evaluate all conditions after a run is processed.
import dataquality as dq
dq.init("text_classification")
cond1 = dq.Condition(...)
cond2 = dq.Condition(...)
dq.register_run_report(conditions=[cond1, cond2])
dq.register_run_report(conditions=[cond1], emails=["foo@bar.com"]
You can also build and evaluate conditions by accessing the processed DataFrame.
from dataquality import Condition
df = dq.metrics.get_dataframe("proj_name", "run_name", "training")
cond = Condition(...)
passes, ground_truth = cond.evaluate(df)
How do I build a Condition?
A Condition
is defined as:
class Condition:
agg: AggregateFunction
threshold: float
operator: Operator
metric: Optional[str] = None
filters: Optional[List[ConditionFilter]] = []
To gain an intuition for what can be accomplished, consider the following examples:
- Is the average confidence less than 0.3?
>>> c = Condition(
... agg=AggregateFunction.avg,
... metric="confidence",
... operator=Operator.lt,
... threshold=0.3,
... )
- Is the max DEP greater or equal to 0.45?
>>> c = Condition(
... agg=AggregateFunction.max,
... metric="data_error_potential",
... operator=Operator.gte,
... threshold=0.45,
... )
By adding filters, you can further narrow down the scope of the condition. If the aggregate function is “pct”, you don’t need to specify a metric, as the filters will determine the percentage of data.
- Alert if over 80% of the dataset has confidence under 0.1
>>> c = Condition(
... operator=Operator.gt,
... threshold=0.8,
... agg=AggregateFunction.pct,
... filters=[
... ConditionFilter(
... metric="confidence", operator=Operator.lt, value=0.1
... ),
... ],
... )
- Alert if at least 20% of the dataset has drifted (Inference DataFrames only)
>>> c = Condition(
... operator=Operator.gte,
... threshold=0.2,
... agg=AggregateFunction.pct,
... filters=[
... ConditionFilter(
... metric="is_drifted", operator=Operator.eq, value=True
... ),
... ],
... )
- Alert 5% or more of the dataset contains PII
>>> c = Condition(
... operator=Operator.gte,
... threshold=0.05,
... agg=AggregateFunction.pct,
... filters=[
... ConditionFilter(
... metric="galileo_pii", operator=Operator.neq, value="None"
... ),
... ],
... )
Complex conditions can be built when the filter has a different metric than the metric used in the condition.
- Alert if the min confidence of drifted data is less than 0.15
>>> c = Condition(
... agg=AggregateFunction.min,
... metric="confidence",
... operator=Operator.lt,
... threshold=0.15,
... filters=[
... ConditionFilter(
... metric="is_drifted", operator=Operator.eq, value=True
... )
... ],
... )
- Alert if over 50% of high DEP (>=0.7) data contains PII:
>>> c = Condition(
... operator=Operator.gt,
... threshold=0.5,
... agg=AggregateFunction.pct,
... filters=[
... ConditionFilter(
... metric="data_error_potential", operator=Operator.gte, value=0.7
... ),
... ConditionFilter(
... metric="galileo_pii", operator=Operator.neq, value="None"
... ),
... ],
... )
You can also call conditions directly, which will assert its truth against a DataFrame.
- Assert that average confidence less than 0.3
>>> c = Condition(
... agg=AggregateFunction.avg,
... metric="confidence",
... operator=Operator.lt,
... threshold=0.3,
... )
>>> c(df)
Aggregate Function
from dataquality import AggregateFunction
The available aggregate functions are:
class AggregateFunction(str, Enum):
avg = "avg"
min = "min"
max = "max"
sum = "sum"
pct = "pct"
Operator
from dataquality import Operator
The available operators are:
class Operator(str, Enum):
eq = "eq"
neq = "neq"
gt = "gt"
lt = "lt"
gte = "gte"
lte = "lte"
Metric & Treshold
The metric must be the name of a column in the DataFrame. Threshold is a numeric value for comparison in the Condition.
Alerting
Alerting via email, slack in development. Please reach out to Galileo at team@rungalileo.io for more information.