正则表达式扩展 - wangshengliang

ES6 及后续版本对正则表达式做了大量增强，让模式匹配更加强大。

u 修饰符#

Unicode 模式#

u 修饰符启用 Unicode 模式，正确处理 32 位字符：

// 不带 u，无法正确匹配
;/^.$/.test('𠮷') // false（被当作两个字符）

// 带 u，正确匹配
;/^.$/u.test('𠮷') // true

Unicode 属性转义#

\p{...} 匹配 Unicode 属性（需要 u 标志）：

// 匹配任意字母
;/\p{Letter}/u.test('中') // true
;/\p{Letter}/u.test('A') // true
;/\p{Letter}/u.test('1') // false

// 匹配中文字符
;/\p{Script=Han}/u.test('中') // true
;/\p{Script=Han}/u.test('あ') // false

// 匹配 Emoji
;/\p{Emoji}/u.test('😀') // true

// 否定匹配
;/\P{Letter}/u.test('1') // true（非字母）

常用 Unicode 属性：

属性	说明
`\p{Letter}`	任意字母
`\p{Number}`	任意数字
`\p{Punctuation}`	标点符号
`\p{Script=Han}`	中文字符
`\p{Script=Hiragana}`	平假名
`\p{Emoji}`	Emoji 表情

码点转义#

\u{...} 语法在正则中也适用：

// ES5 写法（受限）
;/\uD842\uDFB7/.test('𠮷') // true，但很麻烦

// ES6 写法
;/\u{20BB7}/u.test('𠮷') // true

y 修饰符#

粘连匹配#

y 修饰符（sticky）要求从 lastIndex 位置开始匹配：

const str = 'aaa_aa_a'

// g 修饰符：全局搜索，可以跳过不匹配的字符
const regG = /a+/g
regG.exec(str) // ['aaa']，lastIndex = 3
regG.exec(str) // ['aa']，lastIndex = 7
regG.exec(str) // ['a']，lastIndex = 9

// y 修饰符：必须从当前位置开始匹配
const regY = /a+/y
regY.exec(str) // ['aaa']，lastIndex = 3
regY.exec(str) // null（位置3是'_'，不匹配）

实际应用#

词法分析（Tokenizer）：

const TOKEN_TYPES = [
  { type: 'NUMBER', pattern: /\d+/y },
  { type: 'PLUS', pattern: /\+/y },
  { type: 'MINUS', pattern: /-/y },
  { type: 'SPACE', pattern: /\s+/y },
]

function tokenize(input) {
  const tokens = []
  let pos = 0

  while (pos < input.length) {
    let matched = false

    for (const { type, pattern } of TOKEN_TYPES) {
      pattern.lastIndex = pos
      const match = pattern.exec(input)

      if (match) {
        if (type !== 'SPACE') {
          tokens.push({ type, value: match[0] })
        }
        pos = pattern.lastIndex
        matched = true
        break
      }
    }

    if (!matched) {
      throw new Error(`Unexpected character at position ${pos}`)
    }
  }

  return tokens
}

tokenize('12 + 34 - 5')
// [
//   { type: 'NUMBER', value: '12' },
//   { type: 'PLUS', value: '+' },
//   { type: 'NUMBER', value: '34' },
//   { type: 'MINUS', value: '-' },
//   { type: 'NUMBER', value: '5' }
// ]

s 修饰符#

dotAll 模式#

默认情况下，. 不匹配换行符。s 修饰符让 . 匹配包括换行在内的任意字符：

const str = 'Hello\nWorld'

// 不带 s
;/Hello.World/.test(str) // false

// 带 s
;/Hello.World/s.test(str) // true

// 检查是否启用 dotAll
;/./s.dotAll // true

多行文本处理：

const html = `<div>
  <p>内容</p>
</div>`

// 匹配 div 标签及其内容
;/<div>.*<\/div>/s.test(html) // true

// 不带 s 无法匹配
;/<div>.*<\/div>/.test(html) // false

命名捕获组#

基本语法#

用 (?<name>...) 给捕获组命名：

const dateReg = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
const match = dateReg.exec('2024-01-15')

console.log(match.groups.year) // '2024'
console.log(match.groups.month) // '01'
console.log(match.groups.day) // '15'

// 解构使用
const {
  groups: { year, month, day },
} = dateReg.exec('2024-01-15')
console.log(year, month, day) // 2024 01 15

反向引用#

用 \k<name> 引用已命名的捕获组：

// 匹配重复单词
const reg = /\b(?<word>\w+)\s+\k<word>\b/
reg.test('hello hello') // true
reg.test('hello world') // false

// 匹配引号包裹的内容（引号类型必须一致）
const quoteReg = /(?<quote>['"]).*?\k<quote>/
quoteReg.test('"hello"') // true
quoteReg.test("'world'") // true
quoteReg.test('"mixed\'') // false

replace 中使用#

const dateStr = '2024-01-15'

// 替换时使用命名组
const result = dateStr.replace(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
  '$<year>年$<month>月$<day>日'
)
console.log(result) // '2024年01月15日'

// 函数形式
dateStr.replace(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/, (...args) => {
  const groups = args.at(-1) // 最后一个参数是 groups
  return `${groups.year}/${groups.month}/${groups.day}`
})
// '2024/01/15'

后行断言#

ES2018 新增后行断言（Lookbehind），与先行断言对称：

语法对比#

类型	语法	说明
先行肯定断言	`(?=...)`	后面是…
先行否定断言	`(?!...)`	后面不是…
后行肯定断言	`(?<=...)`	前面是…
后行否定断言	`(?<!...)`	前面不是…

使用示例#

// 先行断言：匹配后面是元的数字
;/\d+(?=元)/.exec('100元') // ['100']

// 先行否定断言：匹配后面不是元的数字
;/\d+(?!元)/.exec('100美元') // ['100']

// 后行肯定断言：匹配前面是$的数字
;/(?<=\$)\d+/.exec('$100') // ['100']

// 后行否定断言：匹配前面不是$的数字
;/(?<!\$)\d+/.exec('€100') // ['100']

实际应用：

// 提取价格（前面是￥）
const priceReg = /(?<=￥)\d+(\.\d{2})?/g
'商品A ￥99.00 商品B ￥199.50'.match(priceReg)
// ['99.00', '199.50']

// 密码脱敏（保留前3后4）
function maskPassword(pwd) {
  return pwd.replace(/(?<=.{3}).(?=.{4})/g, '*')
}
maskPassword('12345678') // '123*5678'

// 提取标签内容
const tagReg = /(?<=<title>).*?(?=<\/title>)/
tagReg.exec('<title>Hello World</title>')[0] // 'Hello World'

d 修饰符#

匹配索引#

ES2022 新增 d 修饰符，提供匹配的索引信息：

const reg = /(?<name>\w+)/d
const match = reg.exec('hello world')

console.log(match.indices[0]) // [0, 5]（整体匹配的位置）
console.log(match.indices[1]) // [0, 5]（第一个捕获组的位置）
console.log(match.indices.groups.name) // [0, 5]（命名组的位置）

实际应用：

// 高亮搜索结果
function highlightMatches(text, pattern) {
  const reg = new RegExp(pattern, 'gd')
  const result = []
  let lastIndex = 0
  let match

  while ((match = reg.exec(text)) !== null) {
    const [start, end] = match.indices[0]
    result.push(text.slice(lastIndex, start))
    result.push(`<mark>${text.slice(start, end)}</mark>`)
    lastIndex = end
  }

  result.push(text.slice(lastIndex))
  return result.join('')
}

highlightMatches('hello world hello', 'hello')
// '<mark>hello</mark> world <mark>hello</mark>'

String 方法增强#

matchAll()#

ES2020 新增，返回所有匹配的迭代器：

const str = 'test1test2test3'
const reg = /t(e)(st(\d))/g

// 传统方式需要循环调用 exec
// ES2020 用 matchAll
for (const match of str.matchAll(reg)) {
  console.log(match[0]) // test1, test2, test3
  console.log(match[1]) // e, e, e
  console.log(match.index) // 0, 5, 10
}

// 转为数组
const matches = [...str.matchAll(reg)]

实战技巧#

解析 URL 参数#

function parseQuery(url) {
  const reg = /[?&](?<key>[^=&]+)=(?<value>[^&]*)/g
  const result = {}

  for (const match of url.matchAll(reg)) {
    const { key, value } = match.groups
    result[key] = decodeURIComponent(value)
  }

  return result
}

parseQuery('https://example.com?name=张三&age=25')
// { name: '张三', age: '25' }

验证中文姓名#

const chineseNameReg = /^[\p{Script=Han}]{2,4}$/u

chineseNameReg.test('张三') // true
chineseNameReg.test('欧阳娜娜') // true
chineseNameReg.test('张') // false（太短）
chineseNameReg.test('ABC') // false

电话号码格式化#

function formatPhone(phone) {
  return phone.replace(/(\d{3})(\d{4})(\d{4})/, '$1-$2-$3')
}

formatPhone('13812345678') // '138-1234-5678'

// 用命名组更清晰
function formatPhoneNamed(phone) {
  return phone.replace(
    /(?<area>\d{3})(?<middle>\d{4})(?<last>\d{4})/,
    '$<area>-$<middle>-$<last>'
  )
}

修饰符汇总#

修饰符	名称	说明	版本
`g`	global	全局匹配	ES3
`i`	ignoreCase	忽略大小写	ES3
`m`	multiline	多行模式	ES3
`u`	unicode	Unicode 模式	ES6
`y`	sticky	粘连匹配	ES6
`s`	dotAll	让 . 匹配换行符	ES2018
`d`	hasIndices	返回匹配索引	ES2022

正则表达式的这些扩展让文本处理更加强大和便捷，特别是在处理国际化内容和复杂模式匹配时。