📝 Text Extraction Utility

The Text Extraction Utility is a powerful text parsing and manipulation system designed to extract structured content from user input. It provides comprehensive tools for parsing hashtags, mentions, URLs, emojis, and other structured elements from text, as well as advanced text manipulation capabilities.

🎯 What is Text Extraction?

Text extraction is the process of identifying and extracting specific patterns or structured content from raw text. Think of it as a smart text analyzer that can:

Find hashtags - Identify #tags in social media posts
Extract mentions - Find @username references and user IDs
Locate URLs - Identify web links in text
Parse emojis - Extract emoji shortcodes like 😄
Manipulate text - Insert, replace, and modify text content intelligently

Why Text Extraction is Important

In social media and content management applications, users often include structured elements in their text:

Social engagement - Hashtags help categorize and discover content
User connections - Mentions create links between users
Rich content - URLs and emojis enhance the user experience
Data analysis - Extracted elements can be analyzed for insights

🛠️ Core Components

TextExtractor Class

The main class for extracting structured content from text.

`extractHashtags(text, includeHash)`

Purpose: Extract hashtags from text content.

Parameters:

text (string) - The text to analyze
includeHash (boolean, optional) - Whether to include the # symbol in results (default: true)

Returns: Array of hashtag strings

Example:

const text = 'Check out this #amazing #javascript tutorial!';
const hashtags = TextExtractor.extractHashtags(text);
// Result: ["#amazing", "#javascript"]

const hashtagsWithoutSymbol = TextExtractor.extractHashtags(text, false);
// Result: ["amazing", "javascript"]

How it works:

Uses regex pattern /#(\w+)/g to find hashtag patterns
Captures word characters after the # symbol
Returns array of found hashtags

`extractUserMentions(text)`

Purpose: Extract structured user mentions from text.

Parameters:

text (string) - The text to analyze

Returns: Array of mention objects with:

username (string) - The displayed username
userId (string) - The unique user identifier
filterType (string, optional) - Type of mention filter ('author' or 'mentions')
fullMatch (string) - The complete mention text

Example:

const text = 'Hey @[john_doe|user123] and @[jane_smith|user456|mentions]!';
const mentions = TextExtractor.extractUserMentions(text);
// Result: [
//   {
//     username: "john_doe",
//     userId: "user123",
//     filterType: undefined,
//     fullMatch: "@[john_doe|user123]"
//   },
//   {
//     username: "jane_smith",
//     userId: "user456",
//     filterType: "mentions",
//     fullMatch: "@[jane_smith|user456|mentions]"
//   }
// ]

Mention Format:

Standard: @[username|userId]
Enhanced: @[username|userId|filterType]

`extractUserIds(text)`

Purpose: Extract only the user IDs from mentions.

Parameters:

text (string) - The text to analyze

Returns: Array of user ID strings

Example:

const text = 'Meeting with @[alice|user789] and @[bob|user101]';
const userIds = TextExtractor.extractUserIds(text);
// Result: ["user789", "user101"]

`extractUrls(text)`

Purpose: Extract HTTP/HTTPS URLs from text.

Parameters:

text (string) - The text to analyze

Returns: Array of URL strings

Example:

const text = 'Visit https://example.com and http://test.org for more info';
const urls = TextExtractor.extractUrls(text);
// Result: ["https://example.com", "http://test.org"]

URL Pattern: Matches https?://[^\s<>"{}|\\^[]]+`

`extractEmojiShortcodes(text)`

Purpose: Extract emoji shortcodes from text.

Parameters:

text (string) - The text to analyze

Returns: Array of emoji shortcode strings (without colons)

Example:

const text = "I'm so happy :smile: and excited :tada:!";
const emojis = TextExtractor.extractEmojiShortcodes(text);
// Result: ["smile", "tada"]

`extractAll(text)`

Purpose: Extract all structured content in one operation.

Parameters:

text (string) - The text to analyze

Returns: Object containing all extracted content:

hashtags - Array of hashtags
mentions - Array of mention objects
userIds - Array of user IDs
urls - Array of URLs
emojiShortcodes - Array of emoji shortcodes

Example:

const text =
  'Check out #coding with @[dev|user123] at https://example.com :rocket:';
const extracted = TextExtractor.extractAll(text);
// Result: {
//   hashtags: ["#coding"],
//   mentions: [{ username: "dev", userId: "user123", ... }],
//   userIds: ["user123"],
//   urls: ["https://example.com"],
//   emojiShortcodes: ["rocket"]
// }

TextManipulator Class

Advanced text manipulation and cursor management system.

`insertAtPosition(originalText, insertText, position)`

Purpose: Insert text at a specific position while tracking cursor movement.

Parameters:

originalText (string) - The original text content
insertText (string) - The text to insert
position (number) - The character position to insert at

Returns: Object with:

newText (string) - The modified text
newCursorPosition (number) - Where the cursor should be positioned

Example:

const result = TextManipulator.insertAtPosition(
  'Hello world!',
  ' beautiful',
  5
);
// Result: {
//   newText: "Hello beautiful world!",
//   newCursorPosition: 15
// }

Use cases:

Inserting mentions at cursor position
Adding hashtags or emojis
Auto-completion of text elements

`replaceBetween(originalText, replaceText, startPosition, endPosition)`

Purpose: Replace text between two positions.

Parameters:

originalText (string) - The original text content
replaceText (string) - The replacement text
startPosition (number) - Start position of replacement
endPosition (number) - End position of replacement

Returns: Object with:

newText (string) - The modified text
newCursorPosition (number) - Where the cursor should be positioned

Example:

const result = TextManipulator.replaceBetween(
  'The quick brown fox',
  'red',
  10,
  15 // Replace "brown"
);
// Result: {
//   newText: "The quick red fox",
//   newCursorPosition: 13
// }

Use cases:

Replacing selected text
Updating mentions or hashtags
Text correction and editing

`findLastTrigger(text, position, triggers)`

Purpose: Find the last occurrence of trigger characters before a position.

Parameters:

text (string) - The text to search in
position (number) - The position to search before
triggers (string[]) - Array of trigger characters to look for

Returns: Object with trigger info or null:

index (number) - Position of the trigger character
character (string) - The trigger character found
query (string) - Text after the trigger character

Example:

const result = TextManipulator.findLastTrigger(
  'Hello @john and #coding',
  20, // Position after "coding"
  ['@', '#']
);
// Result: {
//   index: 12,
//   character: "#",
//   query: "coding"
// }

Use cases:

Implementing autocomplete for mentions (@)
Hashtag suggestions (#)
Emoji picker triggers (:)
Command detection (/)

🔧 Technical Implementation

Regular Expressions Used

The utility uses carefully crafted regular expressions for reliable pattern matching:

// Hashtag pattern
const hashtagRegex = /#(\w+)/g;

// Mention pattern (structured format)
const mentionRegex = /@\[([^|]+)\|([^|]+)(?:\|([^|]+))?\]/g;

// URL pattern
const urlRegex = /https?:\/\/[^\s<>"{}|\\^`[\]]+/g;

// Emoji shortcode pattern
const emojiRegex = /:([a-zA-Z0-9_+-]+):/g;

Performance Considerations

Efficient regex execution - Uses global regex with proper reset
Memory management - Processes text in single passes
Caching opportunities - Results can be cached for repeated operations
Streaming support - Can process large texts incrementally

Error Handling

The utility includes robust error handling:

// Safe text processing
if (!text) return [];

// Graceful fallbacks
try {
  // Process text
} catch (error) {
  console.warn('Text extraction failed:', error);
  return [];
}

🎨 Integration Examples

React Component Integration

import { TextExtractor, TextManipulator } from '@lib/utils/text-extraction';

function SocialPostEditor() {
  const [text, setText] = useState('');
  const [cursorPosition, setCursorPosition] = useState(0);

  const handleTextChange = (newText: string) => {
    setText(newText);

    // Extract structured content
    const extracted = TextExtractor.extractAll(newText);
    console.log('Hashtags:', extracted.hashtags);
    console.log('Mentions:', extracted.mentions);
  };

  const insertMention = (username: string, userId: string) => {
    const mentionText = `@[${username}|${userId}]`;
    const result = TextManipulator.insertAtPosition(
      text,
      mentionText,
      cursorPosition
    );

    setText(result.newText);
    setCursorPosition(result.newCursorPosition);
  };

  return (
    <textarea
      value={text}
      onChange={(e) => handleTextChange(e.target.value)}
      onSelect={(e) => setCursorPosition(e.target.selectionStart)}
    />
  );
}

API Integration

// Process user input on the server
import { TextExtractor } from '@lib/utils/text-extraction';

export async function processUserPost(content: string) {
  const extracted = TextExtractor.extractAll(content);

  // Store hashtags for discovery
  await saveHashtags(extracted.hashtags);

  // Create user notifications for mentions
  await notifyMentionedUsers(extracted.userIds);

  // Process URLs for link previews
  await generateLinkPreviews(extracted.urls);

  return {
    content,
    metadata: {
      hashtags: extracted.hashtags,
      mentions: extracted.mentions,
      urls: extracted.urls,
      emojis: extracted.emojiShortcodes
    }
  };
}

🐛 Common Issues and Solutions

Issue: "Hashtags not detected in non-English text"

Solution: The current regex uses \w+ which may not capture all Unicode characters. For international support, consider using [^\s#]+ pattern.

Issue: "Mentions with special characters break parsing"

Solution: Ensure usernames are properly encoded when creating mention format. Use URL encoding for special characters.

Issue: "URLs with query parameters get truncated"

Solution: The URL regex is designed to stop at whitespace and common delimiters. This is intentional to avoid capturing surrounding punctuation.

Issue: "Cursor position becomes incorrect after text manipulation"

Solution: Always use the newCursorPosition returned by TextManipulator methods to maintain proper cursor tracking.

🔒 Security Considerations

Input Validation

Sanitize extracted content - Always validate extracted URLs, usernames, and hashtags
Prevent injection attacks - Don't directly execute or render extracted content without sanitization
Rate limiting - Limit the number of mentions/hashtags per post to prevent spam

Content Filtering

// Example security wrapper
function secureExtractHashtags(text: string): string[] {
  const hashtags = TextExtractor.extractHashtags(text);

  return hashtags.filter((tag) => {
    // Remove potentially harmful content
    return (
      tag.length <= 50 &&
      !tag.includes('<script>') &&
      /^[a-zA-Z0-9_-]+$/.test(tag)
    );
  });
}

🚀 Best Practices

For Developers

Always validate input - Check for null/undefined text before processing
Use appropriate extraction methods - Don't use extractAll() if you only need hashtags
Cache results when possible - Extraction can be expensive for large texts
Handle edge cases - Empty strings, very long texts, malformed patterns

For Performance

Batch operations - Process multiple texts together when possible
Lazy evaluation - Only extract what you need when you need it
Memory management - Clear large result arrays when done
Streaming for large content - Process very large texts in chunks

For User Experience

Real-time feedback - Show extracted elements as user types
Visual indicators - Highlight hashtags, mentions, and URLs in the UI
Autocomplete integration - Use findLastTrigger() for smart suggestions
Error recovery - Gracefully handle malformed input

Content Parsers - Advanced content parsing and tokenization
Rich Text Parser - Markdown and rich text processing
Emoji Parser - Emoji processing and rendering
Components - Rich Input System - UI components using text extraction

🔗 External Resources

Regular Expressions Guide - MDN Web Docs
Unicode in JavaScript - Unicode handling best practices
Text Processing Performance - V8 regex optimization

The Text Extraction Utility provides the foundation for intelligent text processing in social media applications. It enables rich user experiences while maintaining performance and security.

🎯 What is Text Extraction?​

Why Text Extraction is Important​

🛠️ Core Components​

TextExtractor Class​

extractHashtags(text, includeHash)​

extractUserMentions(text)​

extractUserIds(text)​

extractUrls(text)​

extractEmojiShortcodes(text)​

extractAll(text)​

TextManipulator Class​

insertAtPosition(originalText, insertText, position)​

replaceBetween(originalText, replaceText, startPosition, endPosition)​

findLastTrigger(text, position, triggers)​

🔧 Technical Implementation​

Regular Expressions Used​

Performance Considerations​

Error Handling​

🎨 Integration Examples​

React Component Integration​

API Integration​

🐛 Common Issues and Solutions​

Issue: "Hashtags not detected in non-English text"​

Issue: "Mentions with special characters break parsing"​

Issue: "URLs with query parameters get truncated"​

Issue: "Cursor position becomes incorrect after text manipulation"​

🔒 Security Considerations​

Input Validation​

Content Filtering​

🚀 Best Practices​

For Developers​

For Performance​

For User Experience​

📚 Related Documentation​

🔗 External Resources​

🎯 What is Text Extraction?

Why Text Extraction is Important

🛠️ Core Components

TextExtractor Class

`extractHashtags(text, includeHash)`

`extractUserMentions(text)`

`extractUserIds(text)`

`extractUrls(text)`

`extractEmojiShortcodes(text)`

`extractAll(text)`

TextManipulator Class

`insertAtPosition(originalText, insertText, position)`

`replaceBetween(originalText, replaceText, startPosition, endPosition)`

`findLastTrigger(text, position, triggers)`

🔧 Technical Implementation

Regular Expressions Used

Performance Considerations

Error Handling

🎨 Integration Examples

React Component Integration

API Integration

🐛 Common Issues and Solutions

Issue: "Hashtags not detected in non-English text"

Issue: "Mentions with special characters break parsing"

Issue: "URLs with query parameters get truncated"

Issue: "Cursor position becomes incorrect after text manipulation"

🔒 Security Considerations

Input Validation

Content Filtering

🚀 Best Practices

For Developers

For Performance

For User Experience

📚 Related Documentation

🔗 External Resources