[Import-SIG] Rough PEP: A ModuleSpec Type for the Import System

Discussion:

Eric Snow

2013-08-09 06:34:34 UTC

This is an outgrowth of discussions on the .ref PEP, but it's also
something I've been thinking about for over a year and starting toying with
at the last PyCon. I have a patch that passes all but a couple unit tests
and should pass though when I get a minute to take another pass at it.
I'll probably end up adding a bunch more unit tests before I'm done as
well. However, the functionality is mostly there.

BTW, I gotta say, Brett, I have a renewed appreciation for the long and
hard effort you put into importlib. There are just so many odd corner
cases that I never would have looked for if not for that library. And
those unit tests do a great job of covering all of that. Thanks!

-eric

-------------------------------------------------------------------------------

PEP: 4XX
Title: A ModuleSpec Type for the Import System
Version: $Revision$
Last-Modified: $Date$
Author: Eric Snow <ericsnowcurrently at gmail.com>
BDFL-Delegate: ???
Discussions-To: import-sig at python.org
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 8-Aug-2013
Python-Version: 3.4
Post-History: 8-Aug-2013
Resolution:

Abstract
========

This PEP proposes to add a new class to ``importlib.machinery`` called
``ModuleSpec``. It will contain all the import-related information
about a module without needing to load the module first. Finders will
now return a module's spec rather than a loader. The import system will
use the spec to load the module.

Motivation
==========

The import system has evolved over the lifetime of Python. In late 2002
PEP 302 introduced standardized import hooks via ``finders`` and
``loaders`` and ``sys.meta_path``. The ``importlib`` module, introduced
with Python 3.1, now exposes a pure Python implementation of the APIs
described by PEP 302, as well as of the full import system. It is now
much easier to understand and extend the import system. While a benefit
to the Python community, this greater accessibilty also presents a
challenge.

As more developers come to understand and customize the import system,
any weaknesses in the finder and loader APIs will be more impactful. So
the sooner we can address any such weaknesses the import system, the
better...and there are a couple we can take care of with this proposal.

Firstly, any time the import system needs to save information about a
module we end up with more attributes on module objects that are
generally only meaningful to the import system and occoasionally to some
people. It would be nice to have a per-module namespace to put future
import-related information. Secondly, there's an API void between
finders and loaders that causes undue complexity when encountered.

Finders are strictly responsible for providing the loader which the
import system will use to load the module. The loader is then
responsible for doing some checks, creating the module object, setting
import-related attributes, "installing" the module to ``sys.modules``,
and loading the module, along with some cleanup. This all takes place
during the import system's call to ``Loader.load_module()``. Loaders
also provide some APIs for accessing data associated with a module.

Loaders are not required to provide any of the functionality of
``load_module()`` through other methods. Thus, though the import-
related information about a module is likely available without loading
the module, it is not otherwise exposed.

Furthermore, the requirements assocated with ``load_module()`` are
common to all loaders and mostly are implemented in exactly the same
way. This means every loader has to duplicate the same boilerplate
code. ``importlib.util`` provides some tools that help with this, but
it would be more helpful if the import system simply took charge of
these responsibilities. The trouble is that this would limit the degree
of customization that ``load_module()`` facilitates. This is a gap
between finders and loaders which this proposal aims to fill.

Finally, when the import system calls a finder's ``find_module()``, the
finder makes use of a variety of information about the module that is
useful outside the context of the method. Currently the options are
limited for persisting that per-module information past the method call,
since it only returns the loader. Either store it in a module-to-info
mapping somewhere like on the finder itself, or store it on the loader.
Unfortunately, loaders are not required to be module-specific. On top
of that, some of the useful information finders could provide is
common to all finders, so ideally the import system could take care of
that. This is the same gap as before between finders and loaders.

As an example of complexity attributable to this flaw, the
implementation of namespace packages in Python 3.3 (see PEP 420) added
``FileFinder.find_loader()`` because there was no good way for
``find_module()`` to provide the namespace path.

The answer to this gap is a ``ModuleSpec`` object that contains the
per-module information and takes care of the boilerplate functionality
of loading the module.

(The idea grew feet during discussions related to another PEP.[1])

Specification
=============

ModuleSpec
----------

A new class which defines the import-related values to use when loading
the module. It closely corresponds to the import-related attributes of
module objects. ``ModuleSpec`` objects may also be used by finders and
loaders and other import-related APIs to hold extra import-related
information about the module. This greatly reduces the need to add any
new import-related attributes to module objects.

Attributes:

* ``name`` - the module's name (compare to ``__name__``).
* ``loader`` - the loader to use during loading and for module data
(compare to ``__loader__``).
* ``package`` - the name of the module's parent (compare to
``__package__``).
* ``is_package`` - whether or not the module is a package.
* ``origin`` - the location from which the module originates.
* ``filename`` - like origin, but limited to a path-based location
(compare to ``__file__``).
* ``cached`` - the location where the compiled module should be stored
(compare to ``__cached__``).
* ``path`` - the list of path entries in which to search for submodules
or ``None``. (compare to ``__path__``). It should be in sync with
``is_package``.

Those are also the parameters to ``ModuleSpec.__init__()``, in that
order. The last three are optional. When passed the values are taken
as-is. The ``from_loader()`` method offers calculated values.

Methods:

* ``from_loader(cls, ...)`` - returns a new ``ModuleSpec`` derived from the
arguments. The parameters are the same as with ``__init__``, except
``package`` is excluded and only ``name`` and ``loader`` are required.
* ``module_repr()`` - returns a repr for the module.
* ``init_module_attrs(module)`` - sets the module's import-related
attributes.
* ``load(module=None, *, is_reload=False)`` - calls the loader's
``exec_module()``, falling back to ``load_module()`` if necessary.
This method performs the former responsibilities of loaders for
managing modules before actually loading and for cleaning up. The
reload case is facilitated by the ``module`` and ``is_reload``
parameters.

Values Derived by from_loader()
-------------------------------

As implied above, ``from_loader()`` makes a best effort at calculating
any of the values that are not passed in. It duplicates the behavior
that was formerly provided the several ``importlib.util`` functions as
well as the ``init_module_attrs()`` method of several of ``importlib``'s
loaders. Just to be clear, here is a more detailed description of those
calculations:

``is_package`` is derived from ``path``, if passed. Otherwise the
loader's ``is_package()`` is tried. Finally, it defaults to False.

``filename`` is pulled from the loader's ``get_filename()``, if
possible.

``path`` is set to an empty list if ``is_package`` is true, and the
directory from ``filename`` is appended to it, if available.

``cached`` is derived from ``filename`` if it's available.

``origin`` is set to ``filename``.

``package`` is set to ``name`` if the module is a package and
to ``name.rpartition('.')[0]`` otherwise. Consequently, a
top-level module will have ``package`` set to the empty string.

Backward Compatibility
----------------------

Since finder ``find_module()`` methods would now return a module spec
instead of loader, specs must act like the loader that would have been
returned instead. This is relatively simple to solve since the loader
is available as an attribute of the spec.

However, ``ModuleSpec.is_package`` (an attribute) conflicts with
``InspectLoader.is_package()`` (a method). Working around this requires
a more complicated solution but is not a large obstacle.

Unfortunately, the ability to proxy does not extend to ``id()``
comparisons and ``isinstance()`` tests. In the case of the return value
of ``find_module()``, we accept that break in backward compatibility.

Subclassing
-----------

.. XXX Allowed but discouraged?

Module Objects
--------------

Module objects will now have a ``__spec__`` attribute to which the
module's spec will be bound. None of the other import-related module
attributes will be changed or deprecated, though some of them could be.
Any such deprecation can wait until Python 4.

``ModuleSpec`` objects will not be kept in sync with the corresponding
module object's import-related attributes. They may differ, though in
practice they will be the same.

Finders
-------

Finders will now return ModuleSpec objects when ``find_module()`` is
called rather than loaders. For backward compatility, ``Modulespec``
objects proxy the attributes of their ``loader`` attribute.

Adding another similar method to avoid backward-compatibility issues
is undersireable if avoidable. The import APIs have suffered enough.
The approach taken by this PEP should be sufficient.

The change to ``find_module()`` applies to both ``MetaPathFinder`` and
``PathEntryFinder``. ``PathEntryFinder.find_loader()`` will be
deprecated and, for backward compatibility, implicitly special-cased if
the method exists on a finder.

Loaders
-------

Loaders will have a new method, ``exec_module(module)``. Its only job
is to "exec" the module and consequently populate the module's
namespace. It is not responsible for creating or preparing the module
object, nor for any cleanup afterward. It has no return value.

The ``load_module()`` of loaders will still work and be an active part
of the loader API. It is still useful for cases where the default
module creation/prepartion/cleanup is not appropriate for the loader.

A loader must have ``exec_module()`` or ``load_module()`` defined. If
both exist on the loader, ``exec_module()`` is used and
``load_module()`` is ignored.

PEP 420 introduced the optional ``module_repr()`` loader method to limit
the amount of special-casing in the module type's ``__repr__()``. Since
this method is part of ``ModuleSpec``, it will be deprecated on loaders.
However, if it exists on a loader it will be used exclusively.

The loader ``init_module_attr()`` method, added for Python 3.4 will be
eliminated in favor of the same method on ``ModuleSpec``.

However, ``InspectLoader.is_package()`` will not be deprecated even
though the same information is found on ``ModuleSpec``. ``ModuleSpec``
can use it to populate its own ``is_package`` if that information is
not otherwise available. Still, it will be made optional.

In addition to executing a module during loading, loaders will still be
directly responsible for providing APIs concerning module-related data.

Other Changes
-------------

* The various finders and loaders provided by ``importlib`` will be
updated to comply with this proposal.

* The spec for the ``__main__`` module will reflect how the interpreter
was started. For instance, with ``-m`` the spec's name will be that of
the run module, while ``__main__.__name__`` will still be "__main__".

* We add ``importlib.find_module()`` to mirror
``importlib.find_loader()`` (which becomes deprecated).

* Deprecations in ``importlib.util``: ``set_package()``,
``set_loader()``, and ``module_for_loader()``. ``module_to_load()``
(introduced in 3.4) can be removed.

* ``importlib.reload()`` is changed to use ``ModuleSpec.load()``.

* ``ModuleSpec.load()`` and ``importlib.reload()`` will now make use of
the per-module import lock, whereas ``Loader.load_module()`` did not.

Reference Implementation
------------------------

A reference implementation is available at <TBD>.

References
==========

[1] http://mail.python.org/pipermail/import-sig/2013-August/000658.html

Copyright
=========

This document has been placed in the public domain.

..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20130809/7d823183/attachment-0001.html>

Antoine Pitrou

2013-08-09 08:28:03 UTC

Permalink

Hi,

Le Fri, 9 Aug 2013 00:34:34 -0600,

Post by Eric Snow
Abstract
========
This PEP proposes to add a new class to ``importlib.machinery`` called
``ModuleSpec``. It will contain all the import-related information
about a module without needing to load the module first. Finders will
now return a module's spec rather than a loader. The import system
will use the spec to load the module.

Looks good on the principle.

Post by Eric Snow
* ``name`` - the module's name (compare to ``__name__``).
* ``loader`` - the loader to use during loading and for module data
(compare to ``__loader__``).

Should it be the loader or just a factory to build it?
I'm wondering if in some cases creating a loader is costly.

Post by Eric Snow
* ``package`` - the name of the module's parent (compare to
``__package__``).

Is it None if there is no parent?

Post by Eric Snow
* ``is_package`` - whether or not the module is a package.
* ``origin`` - the location from which the module originates.
* ``filename`` - like origin, but limited to a path-based location
(compare to ``__file__``).

Can you explain the difference between origin and filename (or, better,
give an example)?

Post by Eric Snow
* ``load(module=None, *, is_reload=False)`` - calls the loader's
``exec_module()``, falling back to ``load_module()`` if necessary.
This method performs the former responsibilities of loaders for
managing modules before actually loading and for cleaning up. The
reload case is facilitated by the ``module`` and ``is_reload``
parameters.

So how about separate load() and reload() methods?

Or how about keeping the method API?

Post by Eric Snow
Module Objects
--------------
Module objects will now have a ``__spec__`` attribute to which the
module's spec will be bound.

Nice!

Post by Eric Snow
Loaders will have a new method, ``exec_module(module)``. Its only job
is to "exec" the module and consequently populate the module's
namespace. It is not responsible for creating or preparing the module
object, nor for any cleanup afterward. It has no return value.

Does it work with extension modules as well? Generally, extension
modules are populated when created (i.e. the two steps aren't separate
at the C API level, IIRC).

Regards

Antoine.

Brett Cannon

2013-08-09 14:43:10 UTC

Permalink

Post by Antoine Pitrou
Hi,
Le Fri, 9 Aug 2013 00:34:34 -0600,

Looks good on the principle.

Post by Eric Snow
* ``name`` - the module's name (compare to ``__name__``).
* ``loader`` - the loader to use during loading and for module data
(compare to ``__loader__``).

Should it be the loader or just a factory to build it?
I'm wondering if in some cases creating a loader is costly.

Theoretically it could be costly, but up to this point I have not seen a
single loader that cost a lot to create. Every loader I have ever written
just stores details that the finder had to calculate for it's work and
potentially stores something, e.g. an open zipfile that the finder used to
see if a module was there.

Post by Antoine Pitrou

Post by Eric Snow
* ``package`` - the name of the module's parent (compare to
``__package__``).

Is it None if there is no parent?

Top-level modules have the value of '' for __package__. None is used to
represent an unknown value.

-Brett

Post by Antoine Pitrou

Can you explain the difference between origin and filename (or, better,
give an example)?

So how about separate load() and reload() methods?

Or how about keeping the method API?

Post by Eric Snow
Module Objects
--------------
Module objects will now have a ``__spec__`` attribute to which the
module's spec will be bound.

Nice!

Does it work with extension modules as well? Generally, extension
modules are populated when created (i.e. the two steps aren't separate
at the C API level, IIRC).
Regards
Antoine.
_______________________________________________
Import-SIG mailing list
Import-SIG at python.org
http://mail.python.org/mailman/listinfo/import-sig

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20130809/3465489b/attachment.html>

Eric Snow

2013-08-09 16:45:22 UTC

Permalink

Post by Antoine Pitrou
Le Fri, 9 Aug 2013 00:34:34 -0600,

Post by Eric Snow
* ``name`` - the module's name (compare to ``__name__``).
* ``loader`` - the loader to use during loading and for module data
(compare to ``__loader__``).

Should it be the loader or just a factory to build it?
I'm wondering if in some cases creating a loader is costly.

The finder is currently responsible for creating the loader and this PEP
does not propose changing that. So any such loader already has to deal
with this. I suppose some loader could be expensive to create, but none of
the existing loaders in the stdlib are that costly. If some future loader
runs into this problem they can pretty easily write the loader in such a
way that it defers the costly operations. I'll make a note in the PEP
about this.

Post by Antoine Pitrou

Post by Eric Snow
* ``package`` - the name of the module's parent (compare to
``__package__``).

Is it None if there is no parent?

As Brett noted, it is ''. This is the same as the __package__ attribute of
modules. The goal is to keep the same behavior, as much as possible, for
all the feature that are moved into ModuleSpec. I'll make this objective
more clear in the PEP.

Post by Antoine Pitrou

Can you explain the difference between origin and filename (or, better,
give an example)?

Yeah, that wasn't too clear, was it? filename maps directly to the
module's __file__ attribute, which is not set for all modules. For
instance, built-in modules do not set it nor do namespace packages. In
those cases it is still nice to be able to indicate where the module came
from. For built-in modules origin will be set to 'built-in' and for
namespace packages 'namespace'. For any module with a filename, origin is
set to the filename.

Having both origin and filename is meant to provide for different usage.
filename is used to populate a module's __file__ attribute. If set, it
indicates a path-based module (along with cached and path). In contrast,
origin has a broader meaning and is used by the module_repr() method.

I suppose there could be a flag to indicate the module is path-based, but I
went with a separate spec attribute. Likewise, I toyed with the idea of a
path-based subclass, perhaps PathModuleSpec, but wanted to stick with a
one-size-fits-all spec class since it is meant to be used almost
exclusively for state rather than functionality. In some ways it's like
types.SimpleNamespace, but with a couple of import-related methods and some
dedicated state.

I'll make sure the PEP reflects this.

Post by Antoine Pitrou

So how about separate load() and reload() methods?

I thought about that too, but found it simpler to keep them together.
Also, reload is a pretty specialized activity and I plan on leaving some
of the boilerplate of it to importlib.reload(). However, I'm not convinced
either way actually. I'll think about that some more and update the PEP
regardless. Do you have a case to make for making them separate?

Post by Antoine Pitrou

Or how about keeping the method API?

Because it is a static piece of data. At the point that we can remove the
backward compatibility support, we would be stuck with a method when it
should be just a normal attribute.

Post by Antoine Pitrou

Post by Eric Snow
Module Objects
--------------
Module objects will now have a ``__spec__`` attribute to which the
module's spec will be bound.

Nice!

Ironic that this PEP adds yet another import-related attribute to modules.
:) Hopefully it's the last one.

Post by Antoine Pitrou

Does it work with extension modules as well? Generally, extension
modules are populated when created (i.e. the two steps aren't separate
at the C API level, IIRC).

Yeah, it works great. We simply don't implement exec_module() on
ExtensionFileLoader and things just stay the same. There is room to add an
exec_module() and update the C-API for extension modules to support it, but
I'll leaving that out of the PEP. However, I will mention that in the PEP
because your question is quite relevant and not well answered there.

-eric

Post by Antoine Pitrou
Regards
Antoine.
_______________________________________________
Import-SIG mailing list
Import-SIG at python.org
http://mail.python.org/mailman/listinfo/import-sig

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/import-sig/attachments/20130809/467e76b8/attachment-0001.html>

Antoine Pitrou

2013-08-09 19:22:49 UTC

Permalink

On Fri, 9 Aug 2013 10:45:22 -0600