Distinct vs. Group By (SQL)
DISTINCT and GROUP BY are having different semantics even if they happen to have equivalent results. However, from a performance point of view, it is not the same always!
In case of DISTINCT, the rows are redistributed immediately without any pre-aggregation (count, average, etc.) taking place, while in the case of GROUP BY, pre-aggregation is done first and then the unique values are redistributed.
When to choose DISTINCT or GROUP BY?
Having said that, when we have many different values, the pre-aggregation process of GROUP BY is not very efficient as Teradata has to sort the data to remove duplicates (done as part of pre-aggregation).
So, in case of different/distinct data, it may be better to do the redistribution first i.e., use the DISTINCT statement.
In case, if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the removal of duplication step takes place after redistribution.
In short,
GROUP BY — When many duplicates
DISTINCT — Few duplicates and more distinct data.
Points to be considered for Teradata
At times, when using DISTINCT, we may run out of spool space on Access Module Processor (AMP). The reason is that the redistribution takes place immediately and could cause AMPs to run out of space.
If this happens, probably GROUP BY is a better option as duplicates are already removed in the first step and less data are moved across the AMPs.