Data Classification for Google BigQuery

Notes

The cost described in this article is cost associated with your GCP account, and not the cost for MineOS subscription.
The information below applies to BigQuery tables, and not views.

How MineOS Smart Sampling Works with Google BigQuery

MineOS uses a smart sampling approach to classify data in your BigQuery environment efficiently. Rather than scanning entire datasets, we analyze a statistically representative sample of each table to identify data types (PII, PHI, PCI, GDPR special categories, etc.) while minimizing query costs.

Key Features:

Project-level scanning: Classification is configured at the BigQuery Project level for comprehensive coverage
Statistical sampling: We use proven statistical formulas to determine the minimum sample size needed to accurately represent your entire dataset
Cost-optimized queries: Our approach is specifically designed to minimize BigQuery processing costs

Understanding BigQuery CostsBigQuery charges are based on the number of bytes processed (billed bytes), not the number of rows returned. This means:

Minimum billing: 10MB per query
Block-based processing: BigQuery processes data in memory blocks (typically ~64MB each)
Cost is determined by how many blocks are scanned, not how many rows are examined

MineOS Cost Control MechanismsWe've implemented multiple safeguards to keep your BigQuery scanning costs predictable and minimal:

Metadata-First ApproachWe retrieve table row counts from BigQuery metadata rather than running expensive COUNT(*) queries, eliminating unnecessary data processing.
Block-Aware OptimizationOur sampling logic is optimized for BigQuery's memory block architecture, calculating the most efficient sample percentage to avoid processing mostly empty blocks while maintaining statistical accuracy.
Percentage-Based Table SamplingWe use BigQuery's TABLESAMPLE function with calculated percentages. For example, if we need to sample 20% of a table, BigQuery processes approximately 20% of memory blocks rather than the full dataset.Note: If your tables use partitioning, costs may be slightly higher since memory blocks cannot span multiple partitions. Tables with many small partitions may require scanning more blocks than non-partitioned tables of the same size.
Hard Cap ProtectionIn sampling, we enforce a maximum limit on processed bytes per table (2TB). If this threshold is approached, the query is automatically canceled at zero cost, preventing unexpected billing spikes.
Sample Size LimitsWe apply maximum row caps per table based on statistical relevance, ensuring we never process more data than necessary for accurate classification.Cost Transparency & EstimatesTypical Cost Profile:For most BigQuery environments, smart sampling costs are minimal compared to full data scans:

Small tables (<1GB): Often fall under the 10MB minimum, resulting in negligible costs
Medium tables (1-100GB): Sampling typically processes 1-5% of total data
Large tables (>100GB): Smart sampling processes a statistically valid subset, often <1% of total data.

Example:

Full table scan of 1TB table: ~$5 USD (processing 1TB)
MineOS smart sample of same table: ~$0.05-0.25 USD (processing 10-50GB sample)

Best Practices for Cost Management

Start with a subset: Test classification on a few tables or a single project before scaling to your entire BigQuery environment
Review table metadata: Ensure unnecessary or archived tables are excluded from scanning scope
Monitor BigQuery billing: Track actual costs in your GCP console during and after scanning
Leverage partitioned tables: While partitioning doesn't reduce our sampling costs directly, it helps organize data for more targeted classification scopes
Contact support: If you have concerns about specific large tables or cost thresholds, reach out to our team—we can adjust sampling parameters for your environment

Frequently Asked Questions

Q: Will I be charged for the BigQuery queries MineOS runs?

Yes, BigQuery queries initiated by MineOS will appear in your GCP billing under your project. However, our smart sampling approach minimizes these costs significantly compared to full table scans.

Q: Can I set a budget limit for classification costs?

Yes, we can work with you to define acceptable cost thresholds and adjust scanning scope accordingly. Our 2TB per-table hard cap provides automatic protection against runaway costs.

Q: How often does classification need to run?

Smart sampling is typically a one-time or periodic activity (quarterly/annually), not continuous. Once data types are classified, you only need to re-scan when significant schema or data changes occur.

Q: What if I have very large tables (multi-TB)?

Our hard cap protection prevents processing more than 2TB per table. For extremely large tables, we recommend reviewing sampling parameters with our team to balance cost and classification accuracy.

Data Classification for Google BigQuery - Predicting Cost

Understand cost involved with running data classification on BigQuery Datasets