准备

内核版本: 4.20.1

上一篇Linux环境写文件如何稳定跑满磁盘I-O带宽我们使用了mmap来帮助我们写文件稳定的跑满了磁盘I/O,这篇我们来详细介绍一下mmapmunmap的细节和源码分析. 我们使用mmap只是简单的映射文件至内存中,而mmap的设计实现牵扯内核中的虚拟内存等细节.

函数原型

1
2
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);

这是mmap的函数原型,而系统调用的接口在mm/mmap.c中的:

1
2
3
unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff);

虚拟内存管理

这里我们先介绍两个关于虚拟内存的数据结构。虚拟内存概念的相关资料网上已经足够的丰富,这里我们从内核的角度来分析。虚拟空间的管理是以进程为基础的,每个进程都有各自的虚存空间,除此之外,每个进程的“内核空间”是为所有的进程所共享的。一个进程的虚拟地址空间主要由两个数据结构来描述: mm_structvm_area_struct

The Memory Descriptor

mm_struct包括进程中虚拟地址空间的所有信息,mm_struct定义在include/linux/mm_types.h:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
struct mm_struct {
struct {
struct vm_area_struct *mmap; /* vm_area_struct的链表 */
pgd_t * pgd; /* 指向进程的页目录 */

/* ... */
int map_count; /* vm_area_struct数量 */
/* ... */
unsigned long total_vm; /* 映射的Page数量 */
/* ... */
unsigned long start_code, end_code, start_data, end_data; /* 代码段起始结束位置,数据段起始结束位置 */
unsigned long start_brk, brk, start_stack; /* 堆的起始结束位置, 栈因为其性质,只有起始位置 */
unsigned long arg_start, arg_end, env_start, env_end; /* 参数段,环境段的起始结束位置 */
/* ... */
}

}

结合mm_struct和下图32位系统典型的虚拟地址空间分布更能直观的理解(来自《深入理解计算机系统》):

virtual_address_space

Virtual Memory Area

virtual_memory

vm_area_struct描述了虚拟地址空间的一个区间, 一个进程的虚拟空间中可能有多个虚拟区间, vm_area_struct同样定义在include/linux/mm_types.h:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */

unsigned long vm_start; /* 在虚拟地址空间的起始位置 */
unsigned long vm_end; /* 在虚拟地址空间的结束位置*/

/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev; /* 链表中的前继,后继指针 */
struct rb_node vm_rb;

/*
* Largest free memory gap in bytes to the left of this VMA.
* Either between this VMA and vma->vm_prev, or between one of the
* VMAs below us in the VMA rbtree and its ->vm_prev. This helps
* get_unmapped_area find a free area of the right size.
*/
unsigned long rb_subtree_gap;

/* Second cache line starts here. */

/* Function pointers to deal with this struct. */
const struct vm_operations_struct *vm_ops; /* 对这个区间进行操作的函数 */

struct mm_struct *vm_mm; /* vma所属的虚拟地址空间 */
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, see mm.h. */
struct file * vm_file; /* 映射的文件,匿名映射即为nullptr*/

下图是某个进程的虚拟内存简化布局以及相应的几个数据结构之间的关系:

process_address_space

mmap映射原理

  • 检查参数,并根据传入的映射类型设置vma的flags.
  • 进程查找其虚拟地址空间,找到一块空闲的满足要求的虚拟地址空间.
  • 根据找到的虚拟地址空间初始化vma.
  • 设置vma->vm_file.
  • 根据文件系统类型,将vma->vm_ops设为对应的file_operations.
  • vma插入mm的链表中.

源码分析

我们接下来进入mmap的代码分析:

do_mmap()

do_mmap()是整个mmap()的具体操作函数, 我们跳过系统调用来直接看具体实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flags, vm_flags_t vm_flags,
unsigned long pgoff, unsigned long *populate,
struct list_head *uf)
{
struct mm_struct *mm = current->mm; /* 获取该进程的memory descriptor
int pkey = 0;

*populate = 0;
/*
函数对传入的参数进行一系列检查, 假如任一参数出错,都会返回一个errno
*/
if (!len)
return -EINVAL;

/*
* Does the application expect PROT_READ to imply PROT_EXEC?
*
* (the exception is when the underlying filesystem is noexec
* mounted, in which case we dont add PROT_EXEC.)
*/
if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
if (!(file && path_noexec(&file->f_path)))
prot |= PROT_EXEC;
/* force arch specific MAP_FIXED handling in get_unmapped_area */
if (flags & MAP_FIXED_NOREPLACE)
flags |= MAP_FIXED;

/* 假如没有设置MAP_FIXED标志,且addr小于mmap_min_addr, 因为可以修改addr, 所以就需要将addr设为mmap_min_addr的页对齐后的地址 */
if (!(flags & MAP_FIXED))
addr = round_hint_to_min(addr);

/* Careful about overflows.. */
/* 进行Page大小的对齐 */
len = PAGE_ALIGN(len);
if (!len)
return -ENOMEM;

/* offset overflow? */
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
return -EOVERFLOW;

/* Too many mappings? */
/* 判断该进程的地址空间的虚拟区间数量是否超过了限制 */
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;

/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
/* get_unmapped_area从当前进程的用户空间获取一个未被映射区间的起始地址 */
addr = get_unmapped_area(file, addr, len, pgoff, flags);
/* 检查addr是否有效 */
if (offset_in_page(addr))
return addr;

/*  假如flags设置MAP_FIXED_NOREPLACE,需要对进程的地址空间进行addr的检查. 如果搜索发现存在重合的vma, 返回-EEXIST。
这是MAP_FIXED_NOREPLACE标志所要求的
*/
if (flags & MAP_FIXED_NOREPLACE) {
struct vm_area_struct *vma = find_vma(mm, addr);

if (vma && vma->vm_start < addr + len)
return -EEXIST;
}

if (prot == PROT_EXEC) {
pkey = execute_only_pkey(mm);
if (pkey < 0)
pkey = 0;
}

/* Do simple checking here so the lower-level routines won't have
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
/* 假如flags设置MAP_LOCKED,即类似于mlock()将申请的地址空间锁定在内存中, 检查是否可以进行lock*/
if (flags & MAP_LOCKED)
if (!can_do_mlock())
return -EPERM;

if (mlock_future_check(mm, vm_flags, len))
return -EAGAIN;

if (file) { /* file指针不为nullptr, 即从文件到虚拟空间的映射 */
struct inode *inode = file_inode(file); /* 获取文件的inode */
unsigned long flags_mask;

if (!file_mmap_ok(file, inode, pgoff, len))
return -EOVERFLOW;

flags_mask = LEGACY_MAP_MASK | file->f_op->mmap_supported_flags;

/*
...
根据标志指定的map种类,把为文件设置的访问权考虑进去。
如果所请求的内存映射是共享可写的,就要检查要映射的文件是为写入而打开的,而不
是以追加模式打开的,还要检查文件上没有上强制锁。
对于任何种类的内存映射,都要检查文件是否为读操作而打开的。
...
*/
} else {
switch (flags & MAP_TYPE) {
case MAP_SHARED:
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
/*
* Ignore pgoff.
*/
pgoff = 0;
vm_flags |= VM_SHARED | VM_MAYSHARE;
break;
case MAP_PRIVATE:
/*
* Set pgoff according to addr for anon_vma.
*/
pgoff = addr >> PAGE_SHIFT;
break;
default:
return -EINVAL;
}
}
/*
* Set 'VM_NORESERVE' if we should not account for the
* memory use of this mapping.
*/
if (flags & MAP_NORESERVE) {
/* We honor MAP_NORESERVE if allowed to overcommit */
if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
vm_flags |= VM_NORESERVE;

/* hugetlb applies strict overcommit unless MAP_NORESERVE */
if (file && is_file_hugepages(file))
vm_flags |= VM_NORESERVE;
}

addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
*populate = len;
return addr;

mmap_region()

do_mmap()根据用户传入的参数做了一系列的检查,然后根据参数初始化vm_area_struct的标志vm_flagsmmap_region()完成虚拟地址空间与文件的映射的建立:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf)
{
struct mm_struct *mm = current->mm; // 获取该进程的memory descriptor
struct vm_area_struct *vma, *prev;
int error;
struct rb_node **rb_link, *rb_parent;
unsigned long charged = 0;

/* Check against address space limit. */
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
unsigned long nr_pages;

/*
* MAP_FIXED may remove pages of mappings that intersects with
* requested mapping. Account for the pages it would unmap.
*/
nr_pages = count_vma_pages_range(mm, addr, addr + len);

if (!may_expand_vm(mm, vm_flags,
(len >> PAGE_SHIFT) - nr_pages))
return -ENOMEM;
}

/* Clear old maps */
/* 检查[addr, addr+len)的区间是否存在映射空间,假如存在重合的映射空间需要munmap */
while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
&rb_parent)) {
if (do_munmap(mm, addr, len, uf))
return -ENOMEM;
}

/*
* Private writable mapping: check memory availability
*/
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
if (security_vm_enough_memory_mm(mm, charged))
return -ENOMEM;
vm_flags |= VM_ACCOUNT;
}

/*
* Can we just expand an old mapping?
*/
/* 检查是否可以合并[addr, addr+len)区间内的虚拟地址空间vma*/
vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
if (vma) /* 假如合并成功,即使用合并后的vma, 并跳转至out */
goto out;

/*
* Determine the object being mapped and call the appropriate
* specific mapper. the address has already been validated, but
* not unmapped, but the maps are removed from the list.
*/
/* 通过memory descriptor来申请一个vma */
vma = vm_area_alloc(mm);
if (!vma) {
error = -ENOMEM;
goto unacct_error;
}

/* 初始化vma */
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;

if (file) { /* 假如指定了文件映射 */
if (vm_flags & VM_DENYWRITE) { /* 映射的文件不允许写入,调用deny_write_accsess(file)排斥常规的文件操作 */
error = deny_write_access(file);
if (error)
goto free_vma;
}
if (vm_flags & VM_SHARED) { /* 映射的文件允许其他进程可见, 标记文件为可写 */
error = mapping_map_writable(file->f_mapping);
if (error)
goto allow_write_and_free_vma;
}

/* ->mmap() can change vma->vm_file, but must guarantee that
* vma_link() below can deny write-access if VM_DENYWRITE is set
* and map writably if VM_SHARED is set. This usually means the
* new file must not have been exposed to user-space, yet.
*/
vma->vm_file = get_file(file); /* 递增file的引用次数,返回file */
error = call_mmap(file, vma); /* 调用文件系统指定的mmap函数,后面会介绍 */
if (error)
goto unmap_and_free_vma;

/* Can addr have changed??
*
* Answer: Yes, several device drivers can do it in their
* f_op->mmap method. -DaveM
* Bug: If addr is changed, prev, rb_link, rb_parent should
* be updated for vma_link()
*/
WARN_ON_ONCE(addr != vma->vm_start);

addr = vma->vm_start;
vm_flags = vma->vm_flags;
} else if (vm_flags & VM_SHARED) {
/* 假如标志为VM_SHARED,但没有指定映射文件,需要调用shmem_zero_setup()
shmem_zero_setup()实际映射的文件是dev/zero
*/
error = shmem_zero_setup(vma);
if (error)
goto free_vma;
} else {
/* 既没有指定file, 也没有设置VM_SHARED, 即设置为匿名映射 */
vma_set_anonymous(vma);
}

/* 将申请的新vma加入mm中的vma链表*/
vma_link(mm, vma, prev, rb_link, rb_parent);
/* Once vma denies write, undo our temporary denial count */
if (file) {
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
}
file = vma->vm_file;
out:
perf_event_mmap(vma);
/* 更新进程的虚拟地址空间mm */
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
else
mm->locked_vm += (len >> PAGE_SHIFT);
}

if (file)
uprobe_mmap(vma);

/*
* New (or expanded) vma always get soft dirty status.
* Otherwise user-space soft-dirty page tracker won't
* be able to distinguish situation when vma area unmapped,
* then new mapped in-place (which must be aimed as
* a completely new data area).
*/
vma->vm_flags |= VM_SOFTDIRTY;

vma_set_page_prot(vma);

return addr;

unmap_and_free_vma:
vma->vm_file = NULL;
fput(file);

/* Undo any partial mapping done by a device driver. */
unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
charged = 0;
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
allow_write_and_free_vma:
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
free_vma:
vm_area_free(vma);
unacct_error:
if (charged)
vm_unacct_memory(charged);
return error;
}

mmap_region()调用了call_mmap(file, vma): call_mmap根据文件系统的类型选择适配的mmap()函数,我们选择目前常用的ext4:

ext4_file_mmap()ext4对应的mmap, 功能非常简单,更新了file的修改时间(file_accessed(flie)),将对应的operation赋给vma->vm_flags:

三个操作函数的意义:

  • .fault: 处理Page Fault
  • .map_pages: 映射文件至Page Cache
  • .page_mkwrite : 修改文件的状态为可写
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = ext4_filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
};

static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_mapping->host;

if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;

/*
* We don't support synchronous mappings for non-DAX files. At least
* until someone comes with a sensible use case.
*/
if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
return -EOPNOTSUPP;

file_accessed(file);
if (IS_DAX(file_inode(file))) {
vma->vm_ops = &ext4_dax_vm_ops;
vma->vm_flags |= VM_HUGEPAGE;
} else {
vma->vm_ops = &ext4_file_vm_ops;
}
return 0;
}

通过分析mmap的源码我们发现在调用mmap()的时候仅仅申请一个vm_area_struct来建立文件与虚拟内存的映射,并没有建立虚拟内存与物理内存的映射。Linux并不在调用mmap()时就为进程分配物理内存空间,直到下次真正访问地址空间时发现数据不存在于物理内存空间时,触发Page Fault即缺页中断,Linux才会将缺失的Page换入内存空间. 后面的文章我们会介绍Linux的缺页(Page fault)处理和请求Page的机制.

总结

常用的read()首先从文件的Page读取至内核页缓存,然后再从内核态的内存空间拷贝到用户态的内存空间,而mmap直接建立了文件与虚拟地址空间的映射, 可以直接通过MMU根据虚拟地址空间的地址映射从内核的物理内存区读取数据, 省去了内核态拷贝数据至用户态的开销. 因为mmap的修改直接反映在物理内存空间时,所以kill -9进程不会丢数据.

Q&A

  • 如何处理变长的文件?

    Rocksdb使用了mmap的方式写文件, 首先fallocate固定长度len的文件,然后通过mmap建立映射,使用一个base指针来滑动写入位置,写满长度len之后,调用munmap. 假如Close文件时写不够长度len, 即mummap写入的长度,然后使用ftruncate()将多余的映射部分截去.

  • mmap()之后memcpy()出现SIGBUS错误:

    SIGBUS出现在缺页中断处理的过程中,即前面我们提到的ext4_file_vm_opsext4_file_vm_ops(): do_mmap()有一行len = PAGE_ALIGN(len), 即根据传入的参数len进行页对齐后的长度来映射文件,但这里并没有考虑文件size.
    而缺页中断后真正的文件映射读取会考虑文件长度,即读取的offset假如超过了文件size页对齐后的长度,即会返回SIGBUS.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    /*
    * DIV_ROUND_UP()意为向上取整, i_size_read(inode)返回文件的长度(inode->i_size)
    * 假如文件长度为7000, 经过DIV_ROUND_UP(), max_off返回8192
    */
    max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);

    /*
    * offset为memcpy()中目标地址addr所指向的偏移位置,假如超过了max_off,返回了SIGBUS
    */
    if (unlikely(offset >= max_off))
    return VM_FAULT_SIGBUS;
  • mmap()之后memcpy()出现SIGBUS错误: (mm/memory.c:handle_mm_fault())

    1
    2
    3
    4
    5
    6
    7
    if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
    flags & FAULT_FLAG_INSTRUCTION,
    flags & FAULT_FLAG_REMOTE))
    /*
    * 当进程访问试图访问非法的虚拟地址空间,返回SIGSEGV错误
    */
    return VM_FAULT_SIGSEGV;
  • mmap是银弹吗?

    不是, 随机写频繁触发的Page Fault和脏页回写使得mmap避免在内核态与用户态之间的拷贝的优势减弱,下图是Linux环境写文件如何稳定跑满磁盘I-O带宽中方案三的mmap顺序写入的火焰图,我们可以更直观的看到mmap的瓶颈所在:
    mmap_perf