m: Venkatesh Srinivas <me@acm.jhu.edu>
Subject: GSoC Segments: What have I been doing, anyway?
Date: Mon, 16 Aug 2010 10:29:54 -0400

Hi 9fans,

So GSoC is more or less over!

First, I really need to thank David Eckhardt and Erik Quanstrom for putting
up with me this summer; dealing with me can be as frustrating as pulling
teeth with a screwdriver when a patient only speaks another language. Next
time I see either/both of them, I owe them beer/non-alcohol-beverage or
pizza [1]. Also, thanks to Devon for everything, including waiting over an
hour for me downtown.

I have been working on the segment code and making forays into libthread.
Let me first talk about what exactly I did, without talking about the
motivations.

In Plan 9 fresh-out-of-the-box, a process's address space is constructed
from a series of segments, contiguous ranges of address space backed by the
same object. By default, a process has a small number of segments: a Text
segment, backed by the image, a Stack segment, backed by anonymous memory, a
Data segment to back the heap structure, and a BSS segment for the usual
purpose. Each process also has a small series of slots, currently 4, for
other segments, obtained via the segattach() system call and released via
the segdetach() syscall. When a process calls rfork(RFPROC), segments from
the "shared" class are shared across the fork and "memory" class segments
are copy-on-write across the fork; each process gets its own stack. When a
process calls rfork(RFMEM | RFPROC), all segments except the Stack segment
are maintained across the fork except the Stack segment. When a process
class exec(), segments marked with SG_CEXEC are detached; the rest are
inherited across the exec(). The Stack segment can never be inherited.
Across an rfork(RFMEM | RFPROC), new segattach()es and segdetach()es are not
visible - in Ron Minnich's terminology, we have shared memory, but not
shared address spaces.

First, I modified the segment slot structures, to lift the limit on four
user segments. I made the segment array dynamic, resized in segattach(). The
first few elements of the array are as in the current system, the special
Text, Data, BSS, and Stack segments. The rest of the segment array is
address-ordered, and searched via binary searches. The user/system interface
doesn't change, except that the limit on segment attaches is now from the
kernel memory allocator, rather than a fixed per-process limit.

I further changed segattach() to add more flags:
SG_NONE:
A segment with the SG_NONE flag set does not have a backing store. Any
accesses, read or write, cause a fault. This segment flag is useful for
placing red zones at user-desired addresses. It is an error to combine the
SG_NONE and SG_COMMIT flags.

SG_COMMIT:
A segment with the SG_COMMIT flag set is fully pre-faulted and its pages are
not considered by the swapper. An SG_COMMIT segment is maintained at commit
status across and exec() and rfork(RFMEM | RFPROC). If we are unable to
satisfy pre-faults for all of the pages of the segment in segattach(), we
cancel the attach. It is an error to combine the SG_COMMIT flag with
SG_NONE.

SG_SAS:
A segment attached with the SG_SAS flag appears in the address space of all
processes related to the current one by rfork(RFPROC | RFMEM). An SG_SAS
segment will not overlap a segment in any process related via rfork(RFMEM |
RFPROC).

I finally changed libthread. Currently, libthread allocates thread stacks
via malloc()/free(). I converted libthread to allocate thread stacks via
segattach() - each thread stack consists of three segments, an anonymous
segment flanked by two SG_NONE redzones.

Currently I have posted a prototype (very generously called 'prototype')
implementation of the above interface to sources; the prototype kernel omits
a number of the checks claimed above. SG_SAS faults are not handled; SG_SAS
segments must be SG_COMMIT. SG_COMMIT has no limit, which makes it very easy
to crash a system by draining the page queue readily. The prototype
libthread is of considerably higher quality, I think, and would be usable a
production-grade implementation of these interfaces. The prototype kernel is
usable though - I have run it alone on my terminal for approximately a
month.

However, the prototype kernel shows us that the interface can be implemented
efficiently - even when using three segattach()es per thread stack, creating
1024 threads took 2.25s real time on a 400MHz AMD K6, versus 0.87s realtime
with the original libthread and 9 kernel. Creating processes with thousands
of segments is not incredibly speedy, but it is workable and there is a lot
of low-hanging fruit that can improve performance.

The SG_SAS work is fairly unusual for Plan9 - each process originally had a
single, fixed-size segment slot array. Now, a process has a per-process
segment array and a second shared segment array. The shared array is
referenced by all processes created by rfork(RFMEM | RFPROC); the shares are
unlinked on exec() or rfork(RFPROC). The SG_SAS logic was added to match the
current semantics of thread stacks - as they are allocated by malloc() and
free() from the Data segment, they are visible across rfork(RFMEM | RFPROC);
this is as expected - a thread can pass a pointer to a stacked buffer to an
ioproc(), for example. To allow for standalone segments to be used the same
way, they needed to appear across rfork().

This interface would also support a libc memory allocator that uses
standalone segments, rather that constraining it to use sbrk() or
pre-allocated segments. This was my original motivation for this project,
though it was a problem I did not get a chance to address.

Any thoughts or discussion on the interface would rock.

Thanks,
-- vs

[1] http://undeadly.org/cgi?action=article&sid=20100808121724