Input Validation Deep Dive

Input Validation Deep Dive Introduction Input validation is the first line of defense against injection attacks. Every piece of data entering an application — form fields, HTTP headers, URL parameters, file uploads, API payloads — must be validated before processing. The principle is simple: never trust user input. Whitelist vs Blacklist Whitelist (Allowlist) Validation Whitelist validation defines what is allowed and rejects everything else. It is far more secure than blacklisting. import re # Whitelist: only allow specific characters def validate_username_whitelist(username): """Allow only alphanumeric, underscore, and hyphen.""" pattern = r'^[a-zA-Z0-9_-]{3,32}$' if not re.match(pattern, username): raise ValueError( f"Username '{username}' contains invalid characters. " "Only letters, numbers, underscores, and hyphens are allowed." ) return username # Whitelist for country codes ALLOWED_COUNTRIES = {'US', 'CA', 'GB', 'DE', 'FR', 'JP'} def validate_country_code(code): if code.upper() not in ALLOWED_COUNTRIES: raise ValueError(f"Country '{code}' is not in the allowed list") return code.upper() Blacklist (Blocklist) Validation Blacklist validation attempts to block known malicious patterns. It is inherently fragile because attackers constantly discover new bypass techniques. # WEAK: Blacklist approach (easily bypassed) def validate_input_blacklist(input_string): # Easily bypassed — attacker uses alternative syntax blocklist = ['', 'SELECT', 'DROP', 'UNION'] for pattern in blocklist: if pattern.lower() in input_string.lower(): raise ValueError("Blocked pattern detected") return input_string # Bypasses: # ipt> — nested tags # SeLeCt — case variation (if no .lower()) # — unicode encoding # /**/SELECT/**/ — SQL comment injection Sanitization Techniques Sanitization removes or neutralizes dangerous content rather than rejecting the entire input. import html import bleach from markupsafe import escape # HTML sanitization with bleach def sanitize_html(user_content): allowed_tags = ['p', 'b', 'i', 'u', 'em', 'strong', 'a', 'ul', 'ol', 'li'] allowed_attrs = { 'a': ['href', 'title', 'rel'], } allowed_protocols = ['http', 'https', 'mailto'] cleaned = bleach.clean( user_content, tags=allowed_tags, attributes=allowed_attrs, protocols=allowed_protocols, strip=True # Remove disallowed tags entirely ) return cleaned # URL sanitization from urllib.parse import urlparse def sanitize_url(user_url): parsed = urlparse(user_url) # Only allow http and https if parsed.scheme not in ('http', 'https'): return None # Block internal/host-only URLs blocked_hosts = ['localhost', '127.0.0.1', '0.0.0.0', '[::1]'] if parsed.hostname in blocked_hosts: return None # Block private IP ranges import ipaddress try: ip = ipaddress.ip_address(parsed.hostname) if ip.is_private or ip.is_loopback or ip.is_link_local: return None except ValueError: pass # Hostname, not IP — allow return parsed.geturl() Common Bypass Techniques Unicode Normalization Bypass Attackers use unicode characters that normalize to ASCII equivalents. import unicodedata # Normalize input before validation def normalize_and_validate(input_string): # NFC normalization: compose unicode characters normalized = unicodedata.normalize('NFC', input_string) # NFKC normalization: also compatibility (e.g., ² -> 2) strict_normalized = unicodedata.normalize('NFKC', input_string) # Validate the normalized version return validate_username_whitelist(strict_normalized) # Example bypass attempt: # Username: "admin²" (with superscript 2) # After NFKC: "admin2" # This avoids a blocklist for "admin" username Null Byte Injection import re def validate_filename(filename): # Remove null bytes before validation if '\x00' in filename: filename = filename.replace('\x00', '') # Also check percent-encoded null if '%00' in filename: raise ValueError("Null bytes not allowed") # Path traversal check if '..' in filename: raise ValueError("Path traversal detected") # Ensure extension is safe allowed_extensions = {'.pdf', '.png', '.jpg', '.txt'} _, ext = os.path.splitext(filename.lower()) if ext not in allowed_extensions: raise ValueError(f"Extension '{ext}' not allowed") return filename Defense in Depth Strategy class InputValidationPipeline: def validate(self, data, schema, context): # Layer 1: Schema validation (type, format, constraints) validated = schema.parse_obj(data) # Layer 2: Business rule validation validated = self._business_rules(validated, context) # Layer 3: Sanitization validated = self._sanitize(validated, context) # Layer 4: Encoding for output context # This happens at the template/render layer return validated def _business_rules(self, data, context): # Example: rate check for financial transactions if hasattr(data, 'amount') and data.amount > 10000: data.requires_approval = True return data def _sanitize(self, data, context): if context.get('allow_html'): data.body = sanitize_html(data.body) else: data.body = escape(data.body) return data Conclusion Whitelist validation is always superior to blacklist. Normalize input before validating (NFKC for Unicode), sanitize rather than reject where appropriate, and apply defense in depth with schema validation, business rules, and context-specific sanitization. Remember that input validation is only half the equation — output encoding is equally important and addresses the vulnerabilities that validation misses.

Input Validation Deep Dive

Input Validation Deep Dive

Related Articles