fedora 14 kernel performance with ip forwarding workload
by Jesse Brandeburg
The other day I was running the stock fedora kernel on my ip
forwarding setup, to see what the performance was, and the performance
wasn't very good.
system is S5520HC dual socket 2.93GHz Xeon 5570 (Nehalem) with 3 quad
port 82580 adapters (12 ports). Traffic is bidirectional 64 byte
packets being forwarded and received on each port, basically port to
port routing. I am only using 12 flows currently.
The driver is igb, and I am using an affinity script that lines up
each pair of ports that are forwarding traffic into optimal
configurations for cache locality. I am also disabling
remote_node_defrag_ratio to stop cross node traffic.
With the fedora default kernel from F14 it appears that
CONFIG_NETFILTER=y means that I cannot unload all of netfilter even if
I stop iptables service.
perf showed netfilter being prominent, and removing it gives me much
higher throughput. Is there a reason CONFIG_NETFILTER=y ? Isn't it a
good thing to be able to disable netfilter if you want to?
Jesse
7 years, 12 months
Re: [kernel] Disable debugging options.
by Thorsten Leemhuis
On 18.06.2012 15:09, Josh Boyer wrote:
> commit 3b37beedf48825354c42e25f7001677320958d38
> Author: Josh Boyer <jwboyer(a)redhat.com>
> Date: Mon Jun 18 09:09:58 2012 -0400
>
> Disable debugging options.
>
> config-generic | 8 ++--
> config-nodebug | 112 ++++++++++++++++++++++++++--------------------------
> config-x86-generic | 2 +-
> kernel.spec | 9 +++-
> 4 files changed, 67 insertions(+), 64 deletions(-)
> ---
> diff --git a/config-generic b/config-generic
> index 6b1c651..08c28ba 100644
> --- a/config-generic
> +++ b/config-generic
> @@ -1446,13 +1446,13 @@ CONFIG_B43_SDIO=y
> CONFIG_B43_BCMA=y
> # CONFIG_B43_BCMA_EXTRA is not set
> CONFIG_B43_BCMA_PIO=y
> -CONFIG_B43_DEBUG=y
> +# CONFIG_B43_DEBUG is not set
> [...]
Just out of curiosity: Why is enabling and disabling the debug options
done via "make {no,}debug" in the git checkout and not by an conditional
within the spec file? That afaics makes it harder to switch the debug
options off if you only have the SRPM at hand.
Or am I missing something obvious here?
CU
knurd
10 years, 6 months
enable kernel options for criu
by Adrian Reber
Resending my mail from about a month ago. Any thoughts about enabling
these two options?
----- Forwarded message from Adrian Reber <adrian(a)lisas.de> -----
Date: Tue, 21 Aug 2012 15:31:02 +0200
From: Adrian Reber <adrian(a)lisas.de>
To: kernel(a)lists.fedoraproject.org
Subject: enable kernel options for criu
Now Checkpoint/Restart in Userspace[1] has released the first version of
their user-space tools I want to ask if it is possible to enable a few
kernel configuration options to use criu? I have prepared a package for
crtools but not yet submitted. As long as the kernel support is not
enabled is does not make much sense.
Following additional options (according to http://criu.org/CR_tools)
are necessary:
CONFIG_CHECKPOINT_RESTORE
which depends on
CONFIG_EXPERT
Any plans on enabling these options? Any chance these options can be
enabled?
Adrian
[1] http://criu.org/
----- End forwarded message -----
10 years, 8 months
[patch F18/rawhide] team: update to latest net-next
by Jiri Pirko
Update team driver to latest net-next.
Split patches available here:
http://people.redhat.com/jpirko/f18_team_update_3/
Jiri Pirko (5):
team: add support for non-ethernet devices
team: don't print warn message on -ESRCH during event send
vlan: add helper which can be called to see if device is used by vlan
team: do not allow to add VLAN challenged port when vlan is used
team: send port changed when added
drivers/net/team/Kconfig | 4 +-
drivers/net/team/team.c | 137 ++++++++++++++++++++++++--------
drivers/net/team/team_mode_broadcast.c | 8 +-
drivers/net/team/team_mode_roundrobin.c | 8 +-
include/linux/if_team.h | 4 +-
include/linux/if_vlan.h | 9 ++-
net/8021q/vlan_core.c | 6 ++
7 files changed, 128 insertions(+), 48 deletions(-)
Signed-off-by: Jiri Pirko <jpirko(a)redhat.com>
diff --git a/drivers/net/team/Kconfig b/drivers/net/team/Kconfig
index 6a7260b..6b08bd4 100644
--- a/drivers/net/team/Kconfig
+++ b/drivers/net/team/Kconfig
@@ -21,7 +21,7 @@ config NET_TEAM_MODE_BROADCAST
---help---
Basic mode where packets are transmitted always by all suitable ports.
- All added ports are setup to have team's mac address.
+ All added ports are setup to have team's device address.
To compile this team mode as a module, choose M here: the module
will be called team_mode_broadcast.
@@ -33,7 +33,7 @@ config NET_TEAM_MODE_ROUNDROBIN
Basic mode where port used for transmitting packets is selected in
round-robin fashion using packet counter.
- All added ports are setup to have team's mac address.
+ All added ports are setup to have team's device address.
To compile this team mode as a module, choose M here: the module
will be called team_mode_roundrobin.
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 341b65d..17d4be3 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -54,29 +54,29 @@ static struct team_port *team_port_get_rtnl(const struct net_device *dev)
}
/*
- * Since the ability to change mac address for open port device is tested in
+ * Since the ability to change device address for open port device is tested in
* team_port_add, this function can be called without control of return value
*/
-static int __set_port_mac(struct net_device *port_dev,
- const unsigned char *dev_addr)
+static int __set_port_dev_addr(struct net_device *port_dev,
+ const unsigned char *dev_addr)
{
struct sockaddr addr;
- memcpy(addr.sa_data, dev_addr, ETH_ALEN);
- addr.sa_family = ARPHRD_ETHER;
+ memcpy(addr.sa_data, dev_addr, port_dev->addr_len);
+ addr.sa_family = port_dev->type;
return dev_set_mac_address(port_dev, &addr);
}
-static int team_port_set_orig_mac(struct team_port *port)
+static int team_port_set_orig_dev_addr(struct team_port *port)
{
- return __set_port_mac(port->dev, port->orig.dev_addr);
+ return __set_port_dev_addr(port->dev, port->orig.dev_addr);
}
-int team_port_set_team_mac(struct team_port *port)
+int team_port_set_team_dev_addr(struct team_port *port)
{
- return __set_port_mac(port->dev, port->team->dev->dev_addr);
+ return __set_port_dev_addr(port->dev, port->team->dev->dev_addr);
}
-EXPORT_SYMBOL(team_port_set_team_mac);
+EXPORT_SYMBOL(team_port_set_team_dev_addr);
static void team_refresh_port_linkup(struct team_port *port)
{
@@ -848,7 +848,10 @@ static struct netpoll_info *team_netpoll_info(struct team *team)
}
#endif
-static void __team_port_change_check(struct team_port *port, bool linkup);
+static void __team_port_change_port_added(struct team_port *port, bool linkup);
+
+static int team_dev_type_check_change(struct net_device *dev,
+ struct net_device *port_dev);
static int team_port_add(struct team *team, struct net_device *port_dev)
{
@@ -857,9 +860,8 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
char *portname = port_dev->name;
int err;
- if (port_dev->flags & IFF_LOOPBACK ||
- port_dev->type != ARPHRD_ETHER) {
- netdev_err(dev, "Device %s is of an unsupported type\n",
+ if (port_dev->flags & IFF_LOOPBACK) {
+ netdev_err(dev, "Device %s is loopback device. Loopback devices can't be added as a team port\n",
portname);
return -EINVAL;
}
@@ -870,6 +872,17 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
return -EBUSY;
}
+ if (port_dev->features & NETIF_F_VLAN_CHALLENGED &&
+ vlan_uses_dev(dev)) {
+ netdev_err(dev, "Device %s is VLAN challenged and team device has VLAN set up\n",
+ portname);
+ return -EPERM;
+ }
+
+ err = team_dev_type_check_change(dev, port_dev);
+ if (err)
+ return err;
+
if (port_dev->flags & IFF_UP) {
netdev_err(dev, "Device %s is up. Set it down before adding it as a team port\n",
portname);
@@ -891,7 +904,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
goto err_set_mtu;
}
- memcpy(port->orig.dev_addr, port_dev->dev_addr, ETH_ALEN);
+ memcpy(port->orig.dev_addr, port_dev->dev_addr, port_dev->addr_len);
err = team_port_enter(team, port);
if (err) {
@@ -948,7 +961,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
team_port_enable(team, port);
list_add_tail_rcu(&port->list, &team->port_list);
__team_compute_features(team);
- __team_port_change_check(port, !!netif_carrier_ok(port_dev));
+ __team_port_change_port_added(port, !!netif_carrier_ok(port_dev));
__team_options_change_check(team);
netdev_info(dev, "Port device %s added\n", portname);
@@ -972,7 +985,7 @@ err_vids_add:
err_dev_open:
team_port_leave(team, port);
- team_port_set_orig_mac(port);
+ team_port_set_orig_dev_addr(port);
err_port_enter:
dev_set_mtu(port_dev, port->orig.mtu);
@@ -983,6 +996,8 @@ err_set_mtu:
return err;
}
+static void __team_port_change_port_removed(struct team_port *port);
+
static int team_port_del(struct team *team, struct net_device *port_dev)
{
struct net_device *dev = team->dev;
@@ -999,8 +1014,7 @@ static int team_port_del(struct team *team, struct net_device *port_dev)
__team_option_inst_mark_removed_port(team, port);
__team_options_change_check(team);
__team_option_inst_del_port(team, port);
- port->removed = true;
- __team_port_change_check(port, false);
+ __team_port_change_port_removed(port);
team_port_disable(team, port);
list_del_rcu(&port->list);
netdev_rx_handler_unregister(port_dev);
@@ -1009,7 +1023,7 @@ static int team_port_del(struct team *team, struct net_device *port_dev)
vlan_vids_del_by_dev(port_dev, dev);
dev_close(port_dev);
team_port_leave(team, port);
- team_port_set_orig_mac(port);
+ team_port_set_orig_dev_addr(port);
dev_set_mtu(port_dev, port->orig.mtu);
synchronize_rcu();
kfree(port);
@@ -1295,17 +1309,18 @@ static void team_set_rx_mode(struct net_device *dev)
static int team_set_mac_address(struct net_device *dev, void *p)
{
+ struct sockaddr *addr = p;
struct team *team = netdev_priv(dev);
struct team_port *port;
- int err;
- err = eth_mac_addr(dev, p);
- if (err)
- return err;
+ if (dev->type == ARPHRD_ETHER && !is_valid_ether_addr(addr->sa_data))
+ return -EADDRNOTAVAIL;
+ memcpy(dev->dev_addr, addr->sa_data, dev->addr_len);
+ dev->addr_assign_type &= ~NET_ADDR_RANDOM;
rcu_read_lock();
list_for_each_entry_rcu(port, &team->port_list, list)
- if (team->ops.port_change_mac)
- team->ops.port_change_mac(team, port);
+ if (team->ops.port_change_dev_addr)
+ team->ops.port_change_dev_addr(team, port);
rcu_read_unlock();
return 0;
}
@@ -1536,6 +1551,45 @@ static const struct net_device_ops team_netdev_ops = {
* rt netlink interface
***********************/
+static void team_setup_by_port(struct net_device *dev,
+ struct net_device *port_dev)
+{
+ dev->header_ops = port_dev->header_ops;
+ dev->type = port_dev->type;
+ dev->hard_header_len = port_dev->hard_header_len;
+ dev->addr_len = port_dev->addr_len;
+ dev->mtu = port_dev->mtu;
+ memcpy(dev->broadcast, port_dev->broadcast, port_dev->addr_len);
+ memcpy(dev->dev_addr, port_dev->dev_addr, port_dev->addr_len);
+ dev->addr_assign_type &= ~NET_ADDR_RANDOM;
+}
+
+static int team_dev_type_check_change(struct net_device *dev,
+ struct net_device *port_dev)
+{
+ struct team *team = netdev_priv(dev);
+ char *portname = port_dev->name;
+ int err;
+
+ if (dev->type == port_dev->type)
+ return 0;
+ if (!list_empty(&team->port_list)) {
+ netdev_err(dev, "Device %s is of different type\n", portname);
+ return -EBUSY;
+ }
+ err = call_netdevice_notifiers(NETDEV_PRE_TYPE_CHANGE, dev);
+ err = notifier_to_errno(err);
+ if (err) {
+ netdev_err(dev, "Refused to change device type\n");
+ return err;
+ }
+ dev_uc_flush(dev);
+ dev_mc_flush(dev);
+ team_setup_by_port(dev, port_dev);
+ call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE, dev);
+ return 0;
+}
+
static void team_setup(struct net_device *dev)
{
ether_setup(dev);
@@ -2245,19 +2299,17 @@ static void __team_options_change_check(struct team *team)
list_add_tail(&opt_inst->tmp_list, &sel_opt_inst_list);
}
err = team_nl_send_event_options_get(team, &sel_opt_inst_list);
- if (err)
+ if (err && err != -ESRCH)
netdev_warn(team->dev, "Failed to send options change via netlink (err %d)\n",
err);
}
/* rtnl lock is held */
-static void __team_port_change_check(struct team_port *port, bool linkup)
+
+static void __team_port_change_send(struct team_port *port, bool linkup)
{
int err;
- if (!port->removed && port->state.linkup == linkup)
- return;
-
port->changed = true;
port->state.linkup = linkup;
team_refresh_port_linkup(port);
@@ -2276,12 +2328,29 @@ static void __team_port_change_check(struct team_port *port, bool linkup)
send_event:
err = team_nl_send_event_port_list_get(port->team);
- if (err)
- netdev_warn(port->team->dev, "Failed to send port change of device %s via netlink\n",
- port->dev->name);
+ if (err && err != -ESRCH)
+ netdev_warn(port->team->dev, "Failed to send port change of device %s via netlink (err %d)\n",
+ port->dev->name, err);
}
+static void __team_port_change_check(struct team_port *port, bool linkup)
+{
+ if (port->state.linkup != linkup)
+ __team_port_change_send(port, linkup);
+}
+
+static void __team_port_change_port_added(struct team_port *port, bool linkup)
+{
+ __team_port_change_send(port, linkup);
+}
+
+static void __team_port_change_port_removed(struct team_port *port)
+{
+ port->removed = true;
+ __team_port_change_send(port, false);
+}
+
static void team_port_change_check(struct team_port *port, bool linkup)
{
struct team *team = port->team;
diff --git a/drivers/net/team/team_mode_broadcast.c b/drivers/net/team/team_mode_broadcast.c
index c96e4d2..9db0171 100644
--- a/drivers/net/team/team_mode_broadcast.c
+++ b/drivers/net/team/team_mode_broadcast.c
@@ -48,18 +48,18 @@ static bool bc_transmit(struct team *team, struct sk_buff *skb)
static int bc_port_enter(struct team *team, struct team_port *port)
{
- return team_port_set_team_mac(port);
+ return team_port_set_team_dev_addr(port);
}
-static void bc_port_change_mac(struct team *team, struct team_port *port)
+static void bc_port_change_dev_addr(struct team *team, struct team_port *port)
{
- team_port_set_team_mac(port);
+ team_port_set_team_dev_addr(port);
}
static const struct team_mode_ops bc_mode_ops = {
.transmit = bc_transmit,
.port_enter = bc_port_enter,
- .port_change_mac = bc_port_change_mac,
+ .port_change_dev_addr = bc_port_change_dev_addr,
};
static const struct team_mode bc_mode = {
diff --git a/drivers/net/team/team_mode_roundrobin.c b/drivers/net/team/team_mode_roundrobin.c
index ad7ed0e..105135a 100644
--- a/drivers/net/team/team_mode_roundrobin.c
+++ b/drivers/net/team/team_mode_roundrobin.c
@@ -66,18 +66,18 @@ drop:
static int rr_port_enter(struct team *team, struct team_port *port)
{
- return team_port_set_team_mac(port);
+ return team_port_set_team_dev_addr(port);
}
-static void rr_port_change_mac(struct team *team, struct team_port *port)
+static void rr_port_change_dev_addr(struct team *team, struct team_port *port)
{
- team_port_set_team_mac(port);
+ team_port_set_team_dev_addr(port);
}
static const struct team_mode_ops rr_mode_ops = {
.transmit = rr_transmit,
.port_enter = rr_port_enter,
- .port_change_mac = rr_port_change_mac,
+ .port_change_dev_addr = rr_port_change_dev_addr,
};
static const struct team_mode rr_mode = {
diff --git a/include/linux/if_team.h b/include/linux/if_team.h
index aa2e167..667f2a5 100644
--- a/include/linux/if_team.h
+++ b/include/linux/if_team.h
@@ -105,7 +105,7 @@ struct team_mode_ops {
bool (*transmit)(struct team *team, struct sk_buff *skb);
int (*port_enter)(struct team *team, struct team_port *port);
void (*port_leave)(struct team *team, struct team_port *port);
- void (*port_change_mac)(struct team *team, struct team_port *port);
+ void (*port_change_dev_addr)(struct team *team, struct team_port *port);
void (*port_enabled)(struct team *team, struct team_port *port);
void (*port_disabled)(struct team *team, struct team_port *port);
};
@@ -231,7 +231,7 @@ static inline struct team_port *team_get_port_by_index_rcu(struct team *team,
return NULL;
}
-extern int team_port_set_team_mac(struct team_port *port);
+extern int team_port_set_team_dev_addr(struct team_port *port);
extern int team_options_register(struct team *team,
const struct team_option *option,
size_t option_count);
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index a810987..e6ff12d 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -74,8 +74,6 @@ static inline struct vlan_ethhdr *vlan_eth_hdr(const struct sk_buff *skb)
/* found in socket.c */
extern void vlan_ioctl_set(int (*hook)(struct net *, void __user *));
-struct vlan_info;
-
static inline int is_vlan_dev(struct net_device *dev)
{
return dev->priv_flags & IFF_802_1Q_VLAN;
@@ -101,6 +99,8 @@ extern int vlan_vids_add_by_dev(struct net_device *dev,
const struct net_device *by_dev);
extern void vlan_vids_del_by_dev(struct net_device *dev,
const struct net_device *by_dev);
+
+extern bool vlan_uses_dev(const struct net_device *dev);
#else
static inline struct net_device *
__vlan_find_dev_deep(struct net_device *real_dev, u16 vlan_id)
@@ -151,6 +151,11 @@ static inline void vlan_vids_del_by_dev(struct net_device *dev,
const struct net_device *by_dev)
{
}
+
+static inline bool vlan_uses_dev(const struct net_device *dev)
+{
+ return false;
+}
#endif
/**
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 8ca533c..b258da8 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -368,3 +368,9 @@ void vlan_vids_del_by_dev(struct net_device *dev,
vlan_vid_del(dev, vid_info->vid);
}
EXPORT_SYMBOL(vlan_vids_del_by_dev);
+
+bool vlan_uses_dev(const struct net_device *dev)
+{
+ return rtnl_dereference(dev->vlan_info) ? true : false;
+}
+EXPORT_SYMBOL(vlan_uses_dev);
--
1.7.11.4
10 years, 8 months
Meeting Agenda Item: Introduction Vipin Kumar
by vipin kumar
Hello folks!
I'm Vipin from Computer Science & Engineering background, IIT-BHU. I worked
for a Telecommunication company CDOT India, as a Research Engineer, for
three years. More about me -
IRC handle : _love_hurts_
Skills : auto-tools, C/C++, gdb, RPM packaging, Documentation, Software
development in SunOS
Interested In : kernel development
I can work for around 10 hrs a week. And if I get work related to
Datastructures, Algorithms, C/C++, Networking..then I can work up to 20 hrs
a week :).
--
Vipin K.
Research Engineer,
C-DOTB, India
10 years, 8 months
[PATCH 0/3] Use rusty-style signed modules
by Josh Boyer
Hi All,
Following is a brief series to change the F18 kernel over to using the
"Rusty" style signed modules. This takes David's 'modsign-rusty' branch
and applies it in place of the currently used 'modsign' patch set.
There is one notable change I've done, which is to replace David's:
MODSIGN: Sign modules during the build process
patch with a different one. The new patch adds a new 'modules_sign'
make target and allows us to still utilize RPM's debuginfo generation
with signed modules. I've attached just that patch below for closer
review.
To spare people's inboxes, patch 3/3 won't contain the full
modsign-rusty-jwb and secure-boot patchsets. Those can be found here:
http://jwboyer.fedorapeople.org/pub/modsign-rusty-jwb.patch
http://jwboyer.fedorapeople.org/pub/secure-boot-20120830.patch
Most of the overall change in these patches is dealing with moving some
of the modules-extra handling around to make it easier. The rest should
be fairly self-explanatory.
I've tested this on both x86_64 and i686/PAE KVM guests using the kernel
command line options to verify things. The modules are indeed still
signed after install, and the debuginfo seems to still work properly via
gdb in that gdb can find the correct .debug files for modules, etc.
Comments/questions welcome.
josh
---
>From d992574c734c346760a370b32a28580d47729f7c Mon Sep 17 00:00:00 2001
From: Josh Boyer <jwboyer(a)redhat.com>
Date: Fri, 14 Sep 2012 11:58:01 -0400
Subject: [PATCH 21/27] MODSIGN: Add modules_sign make target
If CONFIG_MODULE_SIG is set, and 'make modules_sign' is called then this
patch will cause the modules to get a signature installed. The make target
is intended to be run after 'make modules_install', and will modify the
modules in-place in the installed location.
The signature will be appended to the module, along with the payload size,
the signature size and a magic string. This requires private and public
keys to be available. By default these are expected to be found in PGP
keyring files called modsign.sec (the secret key) and modsign.pub (the
public key) in the build root.
If signing occurs, lines like the following will be seen:
SIGN [M] <install path>/fs/foo/foo.ko
will appear in the build log. If the signature step will be skipped and the
following will be seen:
NO SIGN [M] <install path>/fs/foo/foo.ko
NOTE! After the signature step, the signed module must not be passed through
strip. If you wish to strip or otherwise modify the kernel modules, use the
built-in stripping capabilities with 'make modules_install' or perform said
modifications before calling this make target. This restriction may affect
packaging tools (such as rpmbuild) and initramfs composition tools.
Note that I do not agree with this method of attaching signatures to modules.
Based on work by David Howells <dhowells(a)redhat.com>
Signed-off-by: Josh Boyer <jwboyer(a)redhat.com>
---
Makefile | 6 +++
scripts/Makefile.modsign | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+)
create mode 100644 scripts/Makefile.modsign
diff --git a/Makefile b/Makefile
index a347b81..a708eae 100644
--- a/Makefile
+++ b/Makefile
@@ -965,6 +965,12 @@ _modinst_post: _modinst_
$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.fwinst obj=firmware __fw_modinst
$(call cmd,depmod)
+ifeq ($(CONFIG_MODULE_SIG), y)
+PHONY += modules_sign
+modules_sign:
+ $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modsign
+endif
+
else # CONFIG_MODULES
# Modules not configured
diff --git a/scripts/Makefile.modsign b/scripts/Makefile.modsign
new file mode 100644
index 0000000..3ee7d3a
--- /dev/null
+++ b/scripts/Makefile.modsign
@@ -0,0 +1,98 @@
+# ==========================================================================
+# Signing modules
+# ==========================================================================
+
+PHONY := __modsign
+__modsign:
+
+include scripts/Kbuild.include
+
+__modules := $(sort $(shell grep -h '\.ko' /dev/null $(wildcard $(MODVERDIR)/*.mod)))
+modules := $(patsubst %.o,%.ko,$(wildcard $(__modules:.ko=.o)))
+
+PHONY += $(modules)
+__modsign: $(modules)
+ @:
+
+MODSECKEY = ./modsign.sec
+MODPUBKEY = ./modsign.pub
+KEYFLAGS = --no-default-keyring --secret-keyring $(MODSECKEY) --keyring $(MODPUBKEY) --no-default-keyring --homedir . --no-options --no-auto-check-trustdb --no-permission-warning
+
+ifdef CONFIG_MODULE_SIG_SHA1
+KEYFLAGS += --digest-algo=SHA1
+else
+ifdef CONFIG_MODULE_SIG_SHA224
+KEYFLAGS += --digest-algo=SHA224
+else
+ifdef CONFIG_MODULE_SIG_SHA256
+KEYFLAGS += --digest-algo=SHA256
+else
+ifdef CONFIG_MODULE_SIG_SHA384
+KEYFLAGS += --digest-algo=SHA384
+else
+ifdef CONFIG_MODULE_SIG_SHA512
+KEYFLAGS += --digest-algo=SHA512
+else
+endif
+endif
+endif
+endif
+endif
+
+ifdef MODKEYNAME
+KEYFLAGS += --default-key $(MODKEYNAME)
+endif
+
+ifeq ($(wildcard $(MODSECKEY))+$(wildcard $(MODPUBKEY)),$(MODSECKEY)+$(MODPUBKEY))
+ifeq ($(KBUILD_SRC),)
+ # no O= is being used
+ SCRIPTS_DIR := scripts
+else
+ SCRIPTS_DIR := $(KBUILD_SRC)/scripts
+endif
+SIGN_MODULES := 1
+else
+SIGN_MODULES := 0
+endif
+
+# only sign if it's an in-tree module
+ifneq ($(KBUILD_EXTMOD),)
+SIGN_MODULES := 0
+endif
+
+ifeq ($(SIGN_MODULES),1)
+KEYRING_DEP := modsign.sec modsign.pub
+quiet_cmd_sign_ko = SIGN [M] $(2)/$(notdir $@)
+ cmd_sign_ko = \
+ rm -f $(2)/$(notdir $(a)).sig && \
+ gpg --batch --no-greeting $(KEYFLAGS) -b $(2)/$(notdir $@) && \
+ ( \
+ cat $(2)/$(notdir $@) $(2)/$(notdir $(a)).sig && \
+ stat --printf %-5s $(2)/$(notdir $(a)).sig && \
+ echo -n "This Is A Crypto Signed Module" \
+ ) >$(2)/$(notdir $(a)).signed && \
+ mv $(2)/$(notdir $(a)).signed $(2)/$(notdir $@) && \
+ rm -f $(2)/$(notdir $(a)).sig
+else
+KEYRING_DEP :=
+quiet_cmd_sign_ko = NO SIGN [M] $@
+ cmd_sign_ko = \
+ true
+endif
+
+#quiet_cmd_modules_sign = SIGN $@
+# cmd_modules_sign = mkdir -p $(2); cp $@ $(2) ; $(mod_strip_cmd) $(2)/$(notdir $@)
+
+# Modules built outside the kernel source tree go into extra by default
+INSTALL_MOD_DIR ?= extra
+ext-mod-dir = $(INSTALL_MOD_DIR)$(subst $(patsubst %/,%,$(KBUILD_EXTMOD)),,$(@D))
+
+modinst_dir = $(if $(KBUILD_EXTMOD),$(ext-mod-dir),kernel/$(@D))
+
+$(modules):
+ $(call cmd,sign_ko,$(MODLIB)/$(modinst_dir))
+
+# Declare the contents of the .PHONY variable as phony. We keep that
+# # information in a variable se we can use it in if_changed and friends.
+
+.PHONY: $(PHONY)
--
1.7.11.4
10 years, 8 months
[PATCH PULL F17] Uprobes backport from upstream
by Anton Arapov
Hello,
Test build: http://koji.fedoraproject.org/koji/taskinfo?taskID=4490824
This also fixes: https://bugzilla.redhat.com/show_bug.cgi?id=849364
The split-out series is available in the git repository at:
http://fedorapeople.org/cgit/aarapov/public_git/kernel-uprobes.git/log/?h...
(Just @jistone's patch is not upstream that exports functions.)
Ananth N Mavinakayanahalli (1):
uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()
Josh Stone (1):
uprobes: add exports necessary for uprobes use by modules
Oleg Nesterov (52):
uprobes: Optimize is_swbp_at_addr() for current->mm
uprobes: Change read_opcode() to use FOLL_FORCE
uprobes: Introduce find_active_uprobe() helper
uprobes: Teach find_active_uprobe() to provide the "is_swbp" info
uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing
uprobes: Teach handle_swbp() to rely on "is_swbp" rather than uprobes_srcu
uprobes: Kill uprobes_srcu/uprobe_srcu_id
uprobes: Valid_vma() should reject VM_HUGETLB
uprobes: __copy_insn() should ensure a_ops->readpage != NULL
uprobes: Write_opcode()->__replace_page() can race with try_to_unmap()
uprobes: Install_breakpoint() should fail if is_swbp_insn() == T
uprobes: Rework register_for_each_vma() to make it O(n)
uprobes: Change build_map_info() to try kmalloc(GFP_NOWAIT) first
uprobes: Copy_insn() shouldn't depend on mm/vma/vaddr
uprobes: Copy_insn() should not return -ENOMEM if __copy_insn() fails
uprobes: No need to re-check vma_address() in write_opcode()
uprobes: Move BUG_ON(UPROBE_SWBP_INSN_SIZE) from write_opcode() to install_breakpoint()
uprobes: Simplify the usage of uprobe->pending_list
uprobes: Don't use loff_t for the valid virtual address
uprobes: __copy_insn() needs "loff_t offset"
uprobes: Remove the unnecessary initialization in add_utask()
uprobes: Don't recheck vma/f_mapping in write_opcode()
uprobes: __replace_page() should not use page_address_in_vma()
uprobes: Kill write_opcode()->lock_page(new_page)
uprobes: Clean up and document write_opcode()->lock_page(old_page)
uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
uprobes: Suppress uprobe_munmap() from mmput()
uprobes: Fix overflow in vma_address()/find_active_uprobe()
uprobes: Remove copy_vma()->uprobe_mmap()
uprobes: Remove insert_vm_struct()->uprobe_mmap()
uprobes: Teach build_probe_list() to consider the range
uprobes: Introduce vaddr_to_offset(vma, vaddr)
uprobes: Fix register_for_each_vma()->vma_address() check
uprobes: Rename vma_address() and make it return "unsigned long"
uprobes: __replace_page() needs munlock_vma_page()
uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails
uprobes: Kill uprobes_state->count
uprobes: Kill dup_mmap()->uprobe_mmap(), simplify uprobe_mmap/munmap
uprobes: Change uprobe_mmap() to ignore the errors but check fatal_signal_pending()
uprobes: Do not use -EEXIST in install_breakpoint() paths
uprobes: Introduce MMF_HAS_UPROBES
uprobes: Fold uprobe_reset_state() into uprobe_dup_mmap()
uprobes: Remove "verify" argument from set_orig_insn()
uprobes: uprobes_treelock should not disable irqs
uprobes: Introduce MMF_RECALC_UPROBES
uprobes: Teach find_active_uprobe() to clear MMF_HAS_UPROBES
ptrace/x86: Introduce set_task_blockstep() helper
ptrace/x86: Partly fix set_task_blockstep()->update_debugctlmsr() logic
uprobes/x86: Do not (ab)use TIF_SINGLESTEP/user_*_single_step() for single-stepping
uprobes/x86: Xol should send SIGTRAP if X86_EFLAGS_TF was set
uprobes/x86: Fix arch_uprobe_disable_step() && UTASK_SSTEP_TRAPPED interaction
uprobes: Make arch_uprobe_task->saved_trap_nr "unsigned int"
Peter Zijlstra (1):
uprobes: Document uprobe_register() vs uprobe_mmap() race
Sebastian Andrzej Siewior (4):
uprobes: Remove check for uprobe variable in handle_swbp()
uprobes: Don't put NULL pointer in uprobe_register()
uprobes: Introduce arch_uprobe_enable/disable_step()
uprobes/x86: Implement x86 specific arch_uprobe_*_step
Srikar Dronamraju (1):
uprobes: Remove redundant lock_page/unlock_page
Signed-off-by: Anton Arapov <anton(a)redhat.com>
---
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/uprobes.h | 5 +-
arch/x86/kernel/step.c | 53 ++-
arch/x86/kernel/uprobes.c | 55 ++-
include/linux/sched.h | 4 +-
include/linux/uprobes.h | 15 +-
kernel/events/uprobes.c | 819 +++++++++++++++++++--------------------
kernel/fork.c | 6 +-
kernel/ptrace.c | 6 +
mm/mmap.c | 11 +-
10 files changed, 505 insertions(+), 471 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 39bc577..2f9f8ca 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -746,6 +746,8 @@ static inline void update_debugctlmsr(unsigned long debugctlmsr)
wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctlmsr);
}
+extern void set_task_blockstep(struct task_struct *task, bool on);
+
/*
* from system description table in BIOS. Mostly for MCA use, but
* others may find it useful:
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 1e9bed1..8ff8be7 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -42,13 +42,14 @@ struct arch_uprobe {
};
struct arch_uprobe_task {
- unsigned long saved_trap_nr;
#ifdef CONFIG_X86_64
unsigned long saved_scratch_register;
#endif
+ unsigned int saved_trap_nr;
+ unsigned int saved_tf;
};
-extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm);
+extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
extern int arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
extern bool arch_uprobe_xol_was_trapped(struct task_struct *tsk);
diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index c346d11..cd3b243 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -157,6 +157,33 @@ static int enable_single_step(struct task_struct *child)
return 1;
}
+void set_task_blockstep(struct task_struct *task, bool on)
+{
+ unsigned long debugctl;
+
+ /*
+ * Ensure irq/preemption can't change debugctl in between.
+ * Note also that both TIF_BLOCKSTEP and debugctl should
+ * be changed atomically wrt preemption.
+ * FIXME: this means that set/clear TIF_BLOCKSTEP is simply
+ * wrong if task != current, SIGKILL can wakeup the stopped
+ * tracee and set/clear can play with the running task, this
+ * can confuse the next __switch_to_xtra().
+ */
+ local_irq_disable();
+ debugctl = get_debugctlmsr();
+ if (on) {
+ debugctl |= DEBUGCTLMSR_BTF;
+ set_tsk_thread_flag(task, TIF_BLOCKSTEP);
+ } else {
+ debugctl &= ~DEBUGCTLMSR_BTF;
+ clear_tsk_thread_flag(task, TIF_BLOCKSTEP);
+ }
+ if (task == current)
+ update_debugctlmsr(debugctl);
+ local_irq_enable();
+}
+
/*
* Enable single or block step.
*/
@@ -169,19 +196,10 @@ static void enable_step(struct task_struct *child, bool block)
* So no one should try to use debugger block stepping in a program
* that uses user-mode single stepping itself.
*/
- if (enable_single_step(child) && block) {
- unsigned long debugctl = get_debugctlmsr();
-
- debugctl |= DEBUGCTLMSR_BTF;
- update_debugctlmsr(debugctl);
- set_tsk_thread_flag(child, TIF_BLOCKSTEP);
- } else if (test_tsk_thread_flag(child, TIF_BLOCKSTEP)) {
- unsigned long debugctl = get_debugctlmsr();
-
- debugctl &= ~DEBUGCTLMSR_BTF;
- update_debugctlmsr(debugctl);
- clear_tsk_thread_flag(child, TIF_BLOCKSTEP);
- }
+ if (enable_single_step(child) && block)
+ set_task_blockstep(child, true);
+ else if (test_tsk_thread_flag(child, TIF_BLOCKSTEP))
+ set_task_blockstep(child, false);
}
void user_enable_single_step(struct task_struct *child)
@@ -199,13 +217,8 @@ void user_disable_single_step(struct task_struct *child)
/*
* Make sure block stepping (BTF) is disabled.
*/
- if (test_tsk_thread_flag(child, TIF_BLOCKSTEP)) {
- unsigned long debugctl = get_debugctlmsr();
-
- debugctl &= ~DEBUGCTLMSR_BTF;
- update_debugctlmsr(debugctl);
- clear_tsk_thread_flag(child, TIF_BLOCKSTEP);
- }
+ if (test_tsk_thread_flag(child, TIF_BLOCKSTEP))
+ set_task_blockstep(child, false);
/* Always clear TIF_SINGLESTEP... */
clear_tsk_thread_flag(child, TIF_SINGLESTEP);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index dc4e910..9538f00 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -41,6 +41,9 @@
/* Adjust the return address of a call insn */
#define UPROBE_FIX_CALL 0x2
+/* Instruction will modify TF, don't change it */
+#define UPROBE_FIX_SETF 0x4
+
#define UPROBE_FIX_RIP_AX 0x8000
#define UPROBE_FIX_RIP_CX 0x4000
@@ -239,6 +242,10 @@ static void prepare_fixups(struct arch_uprobe *auprobe, struct insn *insn)
insn_get_opcode(insn); /* should be a nop */
switch (OPCODE1(insn)) {
+ case 0x9d:
+ /* popf */
+ auprobe->fixups |= UPROBE_FIX_SETF;
+ break;
case 0xc3: /* ret/lret */
case 0xcb:
case 0xc2:
@@ -409,9 +416,10 @@ static int validate_insn_bits(struct arch_uprobe *auprobe, struct mm_struct *mm,
* arch_uprobe_analyze_insn - instruction analysis including validity and fixups.
* @mm: the probed address space.
* @arch_uprobe: the probepoint information.
+ * @addr: virtual address at which to install the probepoint
* Return 0 on success or a -ve number on error.
*/
-int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm)
+int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long addr)
{
int ret;
struct insn insn;
@@ -645,7 +653,7 @@ void arch_uprobe_abort_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
* Skip these instructions as per the currently known x86 ISA.
* 0x66* { 0x90 | 0x0f 0x1f | 0x0f 0x19 | 0x87 0xc0 }
*/
-bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs)
+static bool __skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs)
{
int i;
@@ -672,3 +680,46 @@ bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs)
}
return false;
}
+
+bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs)
+{
+ bool ret = __skip_sstep(auprobe, regs);
+ if (ret && (regs->flags & X86_EFLAGS_TF))
+ send_sig(SIGTRAP, current, 0);
+ return ret;
+}
+
+void arch_uprobe_enable_step(struct arch_uprobe *auprobe)
+{
+ struct task_struct *task = current;
+ struct arch_uprobe_task *autask = &task->utask->autask;
+ struct pt_regs *regs = task_pt_regs(task);
+
+ autask->saved_tf = !!(regs->flags & X86_EFLAGS_TF);
+
+ regs->flags |= X86_EFLAGS_TF;
+ if (test_tsk_thread_flag(task, TIF_BLOCKSTEP))
+ set_task_blockstep(task, false);
+}
+
+void arch_uprobe_disable_step(struct arch_uprobe *auprobe)
+{
+ struct task_struct *task = current;
+ struct arch_uprobe_task *autask = &task->utask->autask;
+ bool trapped = (task->utask->state == UTASK_SSTEP_TRAPPED);
+ struct pt_regs *regs = task_pt_regs(task);
+ /*
+ * The state of TIF_BLOCKSTEP was not saved so we can get an extra
+ * SIGTRAP if we do not clear TF. We need to examine the opcode to
+ * make it right.
+ */
+ if (unlikely(trapped)) {
+ if (!autask->saved_tf)
+ regs->flags &= ~X86_EFLAGS_TF;
+ } else {
+ if (autask->saved_tf)
+ send_sig(SIGTRAP, task, 0);
+ else if (!(auprobe->fixups & UPROBE_FIX_SETF))
+ regs->flags &= ~X86_EFLAGS_TF;
+ }
+}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a1f493..864054f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -441,6 +441,9 @@ extern int get_dumpable(struct mm_struct *mm);
#define MMF_VM_HUGEPAGE 17 /* set when VM_HUGEPAGE is set on vma */
#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */
+#define MMF_HAS_UPROBES 19 /* has uprobes */
+#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
+
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
struct sighand_struct {
@@ -1581,7 +1584,6 @@ struct task_struct {
#endif
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
- int uprobe_srcu_id;
#endif
};
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index efe4b33..e6f0331 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -99,25 +99,27 @@ struct xol_area {
struct uprobes_state {
struct xol_area *xol_area;
- atomic_t count;
};
+
extern int __weak set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
-extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr, bool verify);
+extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern bool __weak is_swbp_insn(uprobe_opcode_t *insn);
extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
extern int uprobe_mmap(struct vm_area_struct *vma);
extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+extern void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm);
extern void uprobe_free_utask(struct task_struct *t);
extern void uprobe_copy_process(struct task_struct *t);
extern unsigned long __weak uprobe_get_swbp_addr(struct pt_regs *regs);
+extern void __weak arch_uprobe_enable_step(struct arch_uprobe *arch);
+extern void __weak arch_uprobe_disable_step(struct arch_uprobe *arch);
extern int uprobe_post_sstep_notifier(struct pt_regs *regs);
extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
extern void uprobe_notify_resume(struct pt_regs *regs);
extern bool uprobe_deny_signal(void);
extern bool __weak arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
extern void uprobe_clear_state(struct mm_struct *mm);
-extern void uprobe_reset_state(struct mm_struct *mm);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
@@ -138,6 +140,10 @@ static inline void
uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end)
{
}
+static inline void
+uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
+{
+}
static inline void uprobe_notify_resume(struct pt_regs *regs)
{
}
@@ -158,8 +164,5 @@ static inline void uprobe_copy_process(struct task_struct *t)
static inline void uprobe_clear_state(struct mm_struct *mm)
{
}
-static inline void uprobe_reset_state(struct mm_struct *mm)
-{
-}
#endif /* !CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 985be4d..f8a97e7 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -27,24 +27,42 @@
#include <linux/pagemap.h> /* read_mapping_page */
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/export.h>
#include <linux/rmap.h> /* anon_vma_prepare */
#include <linux/mmu_notifier.h> /* set_pte_at_notify */
#include <linux/swap.h> /* try_to_free_swap */
#include <linux/ptrace.h> /* user_enable_single_step */
#include <linux/kdebug.h> /* notifier mechanism */
+#include "../../mm/internal.h" /* munlock_vma_page */
#include <linux/uprobes.h>
#define UINSNS_PER_PAGE (PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)
#define MAX_UPROBE_XOL_SLOTS UINSNS_PER_PAGE
-static struct srcu_struct uprobes_srcu;
static struct rb_root uprobes_tree = RB_ROOT;
static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
#define UPROBES_HASH_SZ 13
+/*
+ * We need separate register/unregister and mmap/munmap lock hashes because
+ * of mmap_sem nesting.
+ *
+ * uprobe_register() needs to install probes on (potentially) all processes
+ * and thus needs to acquire multiple mmap_sems (consequtively, not
+ * concurrently), whereas uprobe_mmap() is called while holding mmap_sem
+ * for the particular process doing the mmap.
+ *
+ * uprobe_register()->register_for_each_vma() needs to drop/acquire mmap_sem
+ * because of lock order against i_mmap_mutex. This means there's a hole in
+ * the register vma iteration where a mmap() can happen.
+ *
+ * Thus uprobe_register() can race with uprobe_mmap() and we can try and
+ * install a probe where one is already installed.
+ */
+
/* serialize (un)register */
static struct mutex uprobes_mutex[UPROBES_HASH_SZ];
@@ -61,17 +79,6 @@ static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
*/
static atomic_t uprobe_events = ATOMIC_INIT(0);
-/*
- * Maintain a temporary per vma info that can be used to search if a vma
- * has already been handled. This structure is introduced since extending
- * vm_area_struct wasnt recommended.
- */
-struct vma_info {
- struct list_head probe_list;
- struct mm_struct *mm;
- loff_t vaddr;
-};
-
struct uprobe {
struct rb_node rb_node; /* node in the rb tree */
atomic_t ref;
@@ -100,20 +107,21 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register)
if (!is_register)
return true;
- if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == (VM_READ|VM_EXEC))
+ if ((vma->vm_flags & (VM_HUGETLB|VM_READ|VM_WRITE|VM_EXEC|VM_SHARED))
+ == (VM_READ|VM_EXEC))
return true;
return false;
}
-static loff_t vma_address(struct vm_area_struct *vma, loff_t offset)
+static unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
{
- loff_t vaddr;
-
- vaddr = vma->vm_start + offset;
- vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+}
- return vaddr;
+static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
+{
+ return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
}
/**
@@ -121,41 +129,27 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset)
* based on replace_page in mm/ksm.c
*
* @vma: vma that holds the pte pointing to page
+ * @addr: address the old @page is mapped at
* @page: the cowed page we are replacing by kpage
* @kpage: the modified page we replace page by
*
* Returns 0 on success, -EFAULT on failure.
*/
-static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage)
+static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
+ struct page *page, struct page *kpage)
{
struct mm_struct *mm = vma->vm_mm;
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *ptep;
spinlock_t *ptl;
- unsigned long addr;
- int err = -EFAULT;
-
- addr = page_address_in_vma(page, vma);
- if (addr == -EFAULT)
- goto out;
-
- pgd = pgd_offset(mm, addr);
- if (!pgd_present(*pgd))
- goto out;
-
- pud = pud_offset(pgd, addr);
- if (!pud_present(*pud))
- goto out;
+ pte_t *ptep;
+ int err;
- pmd = pmd_offset(pud, addr);
- if (!pmd_present(*pmd))
- goto out;
+ /* For try_to_free_swap() and munlock_vma_page() below */
+ lock_page(page);
- ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ err = -EAGAIN;
+ ptep = page_check_address(page, mm, addr, &ptl, 0);
if (!ptep)
- goto out;
+ goto unlock;
get_page(kpage);
page_add_new_anon_rmap(kpage, vma, addr);
@@ -172,11 +166,15 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct
page_remove_rmap(page);
if (!page_mapped(page))
try_to_free_swap(page);
- put_page(page);
pte_unmap_unlock(ptep, ptl);
- err = 0;
-out:
+ if (vma->vm_flags & VM_LOCKED)
+ munlock_vma_page(page);
+ put_page(page);
+
+ err = 0;
+ unlock:
+ unlock_page(page);
return err;
}
@@ -218,79 +216,46 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr, uprobe_opcode_t opcode)
{
struct page *old_page, *new_page;
- struct address_space *mapping;
void *vaddr_old, *vaddr_new;
struct vm_area_struct *vma;
- struct uprobe *uprobe;
- loff_t addr;
int ret;
+retry:
/* Read the page with vaddr into memory */
ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma);
if (ret <= 0)
return ret;
- ret = -EINVAL;
-
- /*
- * We are interested in text pages only. Our pages of interest
- * should be mapped for read and execute only. We desist from
- * adding probes in write mapped pages since the breakpoints
- * might end up in the file copy.
- */
- if (!valid_vma(vma, is_swbp_insn(&opcode)))
- goto put_out;
-
- uprobe = container_of(auprobe, struct uprobe, arch);
- mapping = uprobe->inode->i_mapping;
- if (mapping != vma->vm_file->f_mapping)
- goto put_out;
-
- addr = vma_address(vma, uprobe->offset);
- if (vaddr != (unsigned long)addr)
- goto put_out;
-
ret = -ENOMEM;
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
if (!new_page)
- goto put_out;
+ goto put_old;
__SetPageUptodate(new_page);
- /*
- * lock page will serialize against do_wp_page()'s
- * PageAnon() handling
- */
- lock_page(old_page);
/* copy the page now that we've got it stable */
vaddr_old = kmap_atomic(old_page);
vaddr_new = kmap_atomic(new_page);
memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
-
- /* poke the new insn in, ASSUMES we don't cross page boundary */
- vaddr &= ~PAGE_MASK;
- BUG_ON(vaddr + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE);
- memcpy(vaddr_new + vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+ memcpy(vaddr_new + (vaddr & ~PAGE_MASK), &opcode, UPROBE_SWBP_INSN_SIZE);
kunmap_atomic(vaddr_new);
kunmap_atomic(vaddr_old);
ret = anon_vma_prepare(vma);
if (ret)
- goto unlock_out;
+ goto put_new;
- lock_page(new_page);
- ret = __replace_page(vma, old_page, new_page);
- unlock_page(new_page);
+ ret = __replace_page(vma, vaddr, old_page, new_page);
-unlock_out:
- unlock_page(old_page);
+put_new:
page_cache_release(new_page);
-
-put_out:
+put_old:
put_page(old_page);
+ if (unlikely(ret == -EAGAIN))
+ goto retry;
return ret;
}
@@ -312,16 +277,14 @@ static int read_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_
void *vaddr_new;
int ret;
- ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &page, NULL);
+ ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
if (ret <= 0)
return ret;
- lock_page(page);
vaddr_new = kmap_atomic(page);
vaddr &= ~PAGE_MASK;
memcpy(opcode, vaddr_new + vaddr, UPROBE_SWBP_INSN_SIZE);
kunmap_atomic(vaddr_new);
- unlock_page(page);
put_page(page);
@@ -333,10 +296,20 @@ static int is_swbp_at_addr(struct mm_struct *mm, unsigned long vaddr)
uprobe_opcode_t opcode;
int result;
+ if (current->mm == mm) {
+ pagefault_disable();
+ result = __copy_from_user_inatomic(&opcode, (void __user*)vaddr,
+ sizeof(opcode));
+ pagefault_enable();
+
+ if (likely(result == 0))
+ goto out;
+ }
+
result = read_opcode(mm, vaddr, &opcode);
if (result)
return result;
-
+out:
if (is_swbp_insn(&opcode))
return 1;
@@ -355,10 +328,12 @@ static int is_swbp_at_addr(struct mm_struct *mm, unsigned long vaddr)
int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
{
int result;
-
+ /*
+ * See the comment near uprobes_hash().
+ */
result = is_swbp_at_addr(mm, vaddr);
if (result == 1)
- return -EEXIST;
+ return 0;
if (result)
return result;
@@ -371,24 +346,22 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned
* @mm: the probed process address space.
* @auprobe: arch specific probepoint information.
* @vaddr: the virtual address to insert the opcode.
- * @verify: if true, verify existance of breakpoint instruction.
*
* For mm @mm, restore the original opcode (opcode) at @vaddr.
* Return 0 (success) or a negative errno.
*/
int __weak
-set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, bool verify)
+set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
{
- if (verify) {
- int result;
+ int result;
- result = is_swbp_at_addr(mm, vaddr);
- if (!result)
- return -EINVAL;
+ result = is_swbp_at_addr(mm, vaddr);
+ if (!result)
+ return -EINVAL;
+
+ if (result != 1)
+ return result;
- if (result != 1)
- return result;
- }
return write_opcode(auprobe, mm, vaddr, *(uprobe_opcode_t *)auprobe->insn);
}
@@ -439,11 +412,10 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
{
struct uprobe *uprobe;
- unsigned long flags;
- spin_lock_irqsave(&uprobes_treelock, flags);
+ spin_lock(&uprobes_treelock);
uprobe = __find_uprobe(inode, offset);
- spin_unlock_irqrestore(&uprobes_treelock, flags);
+ spin_unlock(&uprobes_treelock);
return uprobe;
}
@@ -490,12 +462,11 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
*/
static struct uprobe *insert_uprobe(struct uprobe *uprobe)
{
- unsigned long flags;
struct uprobe *u;
- spin_lock_irqsave(&uprobes_treelock, flags);
+ spin_lock(&uprobes_treelock);
u = __insert_uprobe(uprobe);
- spin_unlock_irqrestore(&uprobes_treelock, flags);
+ spin_unlock(&uprobes_treelock);
/* For now assume that the instruction need not be single-stepped */
uprobe->flags |= UPROBE_SKIP_SSTEP;
@@ -520,7 +491,6 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
uprobe->inode = igrab(inode);
uprobe->offset = offset;
init_rwsem(&uprobe->consumer_rwsem);
- INIT_LIST_HEAD(&uprobe->pending_list);
/* add to uprobes_tree, sorted on inode:offset */
cur_uprobe = insert_uprobe(uprobe);
@@ -588,20 +558,22 @@ static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
}
static int
-__copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *insn,
- unsigned long nbytes, unsigned long offset)
+__copy_insn(struct address_space *mapping, struct file *filp, char *insn,
+ unsigned long nbytes, loff_t offset)
{
- struct file *filp = vma->vm_file;
struct page *page;
void *vaddr;
- unsigned long off1;
- unsigned long idx;
+ unsigned long off;
+ pgoff_t idx;
if (!filp)
return -EINVAL;
- idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT);
- off1 = offset &= ~PAGE_MASK;
+ if (!mapping->a_ops->readpage)
+ return -EIO;
+
+ idx = offset >> PAGE_CACHE_SHIFT;
+ off = offset & ~PAGE_MASK;
/*
* Ensure that the page that has the original instruction is
@@ -612,22 +584,20 @@ __copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *ins
return PTR_ERR(page);
vaddr = kmap_atomic(page);
- memcpy(insn, vaddr + off1, nbytes);
+ memcpy(insn, vaddr + off, nbytes);
kunmap_atomic(vaddr);
page_cache_release(page);
return 0;
}
-static int
-copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr)
+static int copy_insn(struct uprobe *uprobe, struct file *filp)
{
struct address_space *mapping;
unsigned long nbytes;
int bytes;
- addr &= ~PAGE_MASK;
- nbytes = PAGE_SIZE - addr;
+ nbytes = PAGE_SIZE - (uprobe->offset & ~PAGE_MASK);
mapping = uprobe->inode->i_mapping;
/* Instruction at end of binary; copy only available bytes */
@@ -638,13 +608,13 @@ copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr)
/* Instruction at the page-boundary; copy bytes in second page */
if (nbytes < bytes) {
- if (__copy_insn(mapping, vma, uprobe->arch.insn + nbytes,
- bytes - nbytes, uprobe->offset + nbytes))
- return -ENOMEM;
-
+ int err = __copy_insn(mapping, filp, uprobe->arch.insn + nbytes,
+ bytes - nbytes, uprobe->offset + nbytes);
+ if (err)
+ return err;
bytes = nbytes;
}
- return __copy_insn(mapping, vma, uprobe->arch.insn, bytes, uprobe->offset);
+ return __copy_insn(mapping, filp, uprobe->arch.insn, bytes, uprobe->offset);
}
/*
@@ -672,9 +642,9 @@ copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr)
*/
static int
install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
- struct vm_area_struct *vma, loff_t vaddr)
+ struct vm_area_struct *vma, unsigned long vaddr)
{
- unsigned long addr;
+ bool first_uprobe;
int ret;
/*
@@ -685,204 +655,194 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
* Hence behave as if probe already existed.
*/
if (!uprobe->consumers)
- return -EEXIST;
-
- addr = (unsigned long)vaddr;
+ return 0;
if (!(uprobe->flags & UPROBE_COPY_INSN)) {
- ret = copy_insn(uprobe, vma, addr);
+ ret = copy_insn(uprobe, vma->vm_file);
if (ret)
return ret;
if (is_swbp_insn((uprobe_opcode_t *)uprobe->arch.insn))
- return -EEXIST;
+ return -ENOTSUPP;
- ret = arch_uprobe_analyze_insn(&uprobe->arch, mm);
+ ret = arch_uprobe_analyze_insn(&uprobe->arch, mm, vaddr);
if (ret)
return ret;
+ /* write_opcode() assumes we don't cross page boundary */
+ BUG_ON((uprobe->offset & ~PAGE_MASK) +
+ UPROBE_SWBP_INSN_SIZE > PAGE_SIZE);
+
uprobe->flags |= UPROBE_COPY_INSN;
}
/*
- * Ideally, should be updating the probe count after the breakpoint
- * has been successfully inserted. However a thread could hit the
- * breakpoint we just inserted even before the probe count is
- * incremented. If this is the first breakpoint placed, breakpoint
- * notifier might ignore uprobes and pass the trap to the thread.
- * Hence increment before and decrement on failure.
+ * set MMF_HAS_UPROBES in advance for uprobe_pre_sstep_notifier(),
+ * the task can hit this breakpoint right after __replace_page().
*/
- atomic_inc(&mm->uprobes_state.count);
- ret = set_swbp(&uprobe->arch, mm, addr);
- if (ret)
- atomic_dec(&mm->uprobes_state.count);
+ first_uprobe = !test_bit(MMF_HAS_UPROBES, &mm->flags);
+ if (first_uprobe)
+ set_bit(MMF_HAS_UPROBES, &mm->flags);
+
+ ret = set_swbp(&uprobe->arch, mm, vaddr);
+ if (!ret)
+ clear_bit(MMF_RECALC_UPROBES, &mm->flags);
+ else if (first_uprobe)
+ clear_bit(MMF_HAS_UPROBES, &mm->flags);
return ret;
}
static void
-remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, loff_t vaddr)
+remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vaddr)
{
- if (!set_orig_insn(&uprobe->arch, mm, (unsigned long)vaddr, true))
- atomic_dec(&mm->uprobes_state.count);
+ /* can happen if uprobe_register() fails */
+ if (!test_bit(MMF_HAS_UPROBES, &mm->flags))
+ return;
+
+ set_bit(MMF_RECALC_UPROBES, &mm->flags);
+ set_orig_insn(&uprobe->arch, mm, vaddr);
}
/*
- * There could be threads that have hit the breakpoint and are entering the
- * notifier code and trying to acquire the uprobes_treelock. The thread
- * calling delete_uprobe() that is removing the uprobe from the rb_tree can
- * race with these threads and might acquire the uprobes_treelock compared
- * to some of the breakpoint hit threads. In such a case, the breakpoint
- * hit threads will not find the uprobe. The current unregistering thread
- * waits till all other threads have hit a breakpoint, to acquire the
- * uprobes_treelock before the uprobe is removed from the rbtree.
+ * There could be threads that have already hit the breakpoint. They
+ * will recheck the current insn and restart if find_uprobe() fails.
+ * See find_active_uprobe().
*/
static void delete_uprobe(struct uprobe *uprobe)
{
- unsigned long flags;
-
- synchronize_srcu(&uprobes_srcu);
- spin_lock_irqsave(&uprobes_treelock, flags);
+ spin_lock(&uprobes_treelock);
rb_erase(&uprobe->rb_node, &uprobes_tree);
- spin_unlock_irqrestore(&uprobes_treelock, flags);
+ spin_unlock(&uprobes_treelock);
iput(uprobe->inode);
put_uprobe(uprobe);
atomic_dec(&uprobe_events);
}
-static struct vma_info *
-__find_next_vma_info(struct address_space *mapping, struct list_head *head,
- struct vma_info *vi, loff_t offset, bool is_register)
+struct map_info {
+ struct map_info *next;
+ struct mm_struct *mm;
+ unsigned long vaddr;
+};
+
+static inline struct map_info *free_map_info(struct map_info *info)
+{
+ struct map_info *next = info->next;
+ kfree(info);
+ return next;
+}
+
+static struct map_info *
+build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
{
+ unsigned long pgoff = offset >> PAGE_SHIFT;
struct prio_tree_iter iter;
struct vm_area_struct *vma;
- struct vma_info *tmpvi;
- unsigned long pgoff;
- int existing_vma;
- loff_t vaddr;
-
- pgoff = offset >> PAGE_SHIFT;
+ struct map_info *curr = NULL;
+ struct map_info *prev = NULL;
+ struct map_info *info;
+ int more = 0;
+ again:
+ mutex_lock(&mapping->i_mmap_mutex);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
continue;
- existing_vma = 0;
- vaddr = vma_address(vma, offset);
-
- list_for_each_entry(tmpvi, head, probe_list) {
- if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
- existing_vma = 1;
- break;
- }
+ if (!prev && !more) {
+ /*
+ * Needs GFP_NOWAIT to avoid i_mmap_mutex recursion through
+ * reclaim. This is optimistic, no harm done if it fails.
+ */
+ prev = kmalloc(sizeof(struct map_info),
+ GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ if (prev)
+ prev->next = NULL;
}
-
- /*
- * Another vma needs a probe to be installed. However skip
- * installing the probe if the vma is about to be unlinked.
- */
- if (!existing_vma && atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
- vi->mm = vma->vm_mm;
- vi->vaddr = vaddr;
- list_add(&vi->probe_list, head);
-
- return vi;
+ if (!prev) {
+ more++;
+ continue;
}
- }
-
- return NULL;
-}
-/*
- * Iterate in the rmap prio tree and find a vma where a probe has not
- * yet been inserted.
- */
-static struct vma_info *
-find_next_vma_info(struct address_space *mapping, struct list_head *head,
- loff_t offset, bool is_register)
-{
- struct vma_info *vi, *retvi;
+ if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
+ continue;
- vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
- if (!vi)
- return ERR_PTR(-ENOMEM);
+ info = prev;
+ prev = prev->next;
+ info->next = curr;
+ curr = info;
- mutex_lock(&mapping->i_mmap_mutex);
- retvi = __find_next_vma_info(mapping, head, vi, offset, is_register);
+ info->mm = vma->vm_mm;
+ info->vaddr = offset_to_vaddr(vma, offset);
+ }
mutex_unlock(&mapping->i_mmap_mutex);
- if (!retvi)
- kfree(vi);
+ if (!more)
+ goto out;
+
+ prev = curr;
+ while (curr) {
+ mmput(curr->mm);
+ curr = curr->next;
+ }
- return retvi;
+ do {
+ info = kmalloc(sizeof(struct map_info), GFP_KERNEL);
+ if (!info) {
+ curr = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+ info->next = prev;
+ prev = info;
+ } while (--more);
+
+ goto again;
+ out:
+ while (prev)
+ prev = free_map_info(prev);
+ return curr;
}
static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
{
- struct list_head try_list;
- struct vm_area_struct *vma;
- struct address_space *mapping;
- struct vma_info *vi, *tmpvi;
- struct mm_struct *mm;
- loff_t vaddr;
- int ret;
+ struct map_info *info;
+ int err = 0;
- mapping = uprobe->inode->i_mapping;
- INIT_LIST_HEAD(&try_list);
+ info = build_map_info(uprobe->inode->i_mapping,
+ uprobe->offset, is_register);
+ if (IS_ERR(info))
+ return PTR_ERR(info);
- ret = 0;
+ while (info) {
+ struct mm_struct *mm = info->mm;
+ struct vm_area_struct *vma;
- for (;;) {
- vi = find_next_vma_info(mapping, &try_list, uprobe->offset, is_register);
- if (!vi)
- break;
+ if (err)
+ goto free;
- if (IS_ERR(vi)) {
- ret = PTR_ERR(vi);
- break;
- }
+ down_write(&mm->mmap_sem);
+ vma = find_vma(mm, info->vaddr);
+ if (!vma || !valid_vma(vma, is_register) ||
+ vma->vm_file->f_mapping->host != uprobe->inode)
+ goto unlock;
- mm = vi->mm;
- down_read(&mm->mmap_sem);
- vma = find_vma(mm, (unsigned long)vi->vaddr);
- if (!vma || !valid_vma(vma, is_register)) {
- list_del(&vi->probe_list);
- kfree(vi);
- up_read(&mm->mmap_sem);
- mmput(mm);
- continue;
- }
- vaddr = vma_address(vma, uprobe->offset);
- if (vma->vm_file->f_mapping->host != uprobe->inode ||
- vaddr != vi->vaddr) {
- list_del(&vi->probe_list);
- kfree(vi);
- up_read(&mm->mmap_sem);
- mmput(mm);
- continue;
- }
+ if (vma->vm_start > info->vaddr ||
+ vaddr_to_offset(vma, info->vaddr) != uprobe->offset)
+ goto unlock;
if (is_register)
- ret = install_breakpoint(uprobe, mm, vma, vi->vaddr);
+ err = install_breakpoint(uprobe, mm, vma, info->vaddr);
else
- remove_breakpoint(uprobe, mm, vi->vaddr);
+ remove_breakpoint(uprobe, mm, info->vaddr);
- up_read(&mm->mmap_sem);
+ unlock:
+ up_write(&mm->mmap_sem);
+ free:
mmput(mm);
- if (is_register) {
- if (ret && ret == -EEXIST)
- ret = 0;
- if (ret)
- break;
- }
- }
-
- list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
- list_del(&vi->probe_list);
- kfree(vi);
+ info = free_map_info(info);
}
- return ret;
+ return err;
}
static int __uprobe_register(struct uprobe *uprobe)
@@ -941,10 +901,12 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *
}
mutex_unlock(uprobes_hash(inode));
- put_uprobe(uprobe);
+ if (uprobe)
+ put_uprobe(uprobe);
return ret;
}
+EXPORT_SYMBOL_GPL(uprobe_register);
/*
* uprobe_unregister - unregister a already registered probe.
@@ -976,81 +938,81 @@ void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consume
if (uprobe)
put_uprobe(uprobe);
}
+EXPORT_SYMBOL_GPL(uprobe_unregister);
-/*
- * Of all the nodes that correspond to the given inode, return the node
- * with the least offset.
- */
-static struct rb_node *find_least_offset_node(struct inode *inode)
+static struct rb_node *
+find_node_in_range(struct inode *inode, loff_t min, loff_t max)
{
- struct uprobe u = { .inode = inode, .offset = 0};
struct rb_node *n = uprobes_tree.rb_node;
- struct rb_node *close_node = NULL;
- struct uprobe *uprobe;
- int match;
while (n) {
- uprobe = rb_entry(n, struct uprobe, rb_node);
- match = match_uprobe(&u, uprobe);
-
- if (uprobe->inode == inode)
- close_node = n;
-
- if (!match)
- return close_node;
+ struct uprobe *u = rb_entry(n, struct uprobe, rb_node);
- if (match < 0)
+ if (inode < u->inode) {
n = n->rb_left;
- else
+ } else if (inode > u->inode) {
n = n->rb_right;
+ } else {
+ if (max < u->offset)
+ n = n->rb_left;
+ else if (min > u->offset)
+ n = n->rb_right;
+ else
+ break;
+ }
}
- return close_node;
+ return n;
}
/*
- * For a given inode, build a list of probes that need to be inserted.
+ * For a given range in vma, build a list of probes that need to be inserted.
*/
-static void build_probe_list(struct inode *inode, struct list_head *head)
+static void build_probe_list(struct inode *inode,
+ struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct list_head *head)
{
- struct uprobe *uprobe;
- unsigned long flags;
- struct rb_node *n;
-
- spin_lock_irqsave(&uprobes_treelock, flags);
-
- n = find_least_offset_node(inode);
+ loff_t min, max;
+ struct rb_node *n, *t;
+ struct uprobe *u;
- for (; n; n = rb_next(n)) {
- uprobe = rb_entry(n, struct uprobe, rb_node);
- if (uprobe->inode != inode)
- break;
+ INIT_LIST_HEAD(head);
+ min = vaddr_to_offset(vma, start);
+ max = min + (end - start) - 1;
- list_add(&uprobe->pending_list, head);
- atomic_inc(&uprobe->ref);
+ spin_lock(&uprobes_treelock);
+ n = find_node_in_range(inode, min, max);
+ if (n) {
+ for (t = n; t; t = rb_prev(t)) {
+ u = rb_entry(t, struct uprobe, rb_node);
+ if (u->inode != inode || u->offset < min)
+ break;
+ list_add(&u->pending_list, head);
+ atomic_inc(&u->ref);
+ }
+ for (t = n; (t = rb_next(t)); ) {
+ u = rb_entry(t, struct uprobe, rb_node);
+ if (u->inode != inode || u->offset > max)
+ break;
+ list_add(&u->pending_list, head);
+ atomic_inc(&u->ref);
+ }
}
-
- spin_unlock_irqrestore(&uprobes_treelock, flags);
+ spin_unlock(&uprobes_treelock);
}
/*
- * Called from mmap_region.
- * called with mm->mmap_sem acquired.
- *
- * Return -ve no if we fail to insert probes and we cannot
- * bail-out.
- * Return 0 otherwise. i.e:
+ * Called from mmap_region/vma_adjust with mm->mmap_sem acquired.
*
- * - successful insertion of probes
- * - (or) no possible probes to be inserted.
- * - (or) insertion of probes failed but we can bail-out.
+ * Currently we ignore all errors and always return 0, the callers
+ * can't handle the failure anyway.
*/
int uprobe_mmap(struct vm_area_struct *vma)
{
struct list_head tmp_list;
struct uprobe *uprobe, *u;
struct inode *inode;
- int ret, count;
if (!atomic_read(&uprobe_events) || !valid_vma(vma, true))
return 0;
@@ -1059,54 +1021,38 @@ int uprobe_mmap(struct vm_area_struct *vma)
if (!inode)
return 0;
- INIT_LIST_HEAD(&tmp_list);
mutex_lock(uprobes_mmap_hash(inode));
- build_probe_list(inode, &tmp_list);
-
- ret = 0;
- count = 0;
+ build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
- loff_t vaddr;
-
- list_del(&uprobe->pending_list);
- if (!ret) {
- vaddr = vma_address(vma, uprobe->offset);
-
- if (vaddr < vma->vm_start || vaddr >= vma->vm_end) {
- put_uprobe(uprobe);
- continue;
- }
-
- ret = install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
-
- /* Ignore double add: */
- if (ret == -EEXIST) {
- ret = 0;
-
- if (!is_swbp_at_addr(vma->vm_mm, vaddr))
- continue;
-
- /*
- * Unable to insert a breakpoint, but
- * breakpoint lies underneath. Increment the
- * probe count.
- */
- atomic_inc(&vma->vm_mm->uprobes_state.count);
- }
-
- if (!ret)
- count++;
+ if (!fatal_signal_pending(current)) {
+ unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
+ install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
}
put_uprobe(uprobe);
}
-
mutex_unlock(uprobes_mmap_hash(inode));
- if (ret)
- atomic_sub(count, &vma->vm_mm->uprobes_state.count);
+ return 0;
+}
- return ret;
+static bool
+vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ loff_t min, max;
+ struct inode *inode;
+ struct rb_node *n;
+
+ inode = vma->vm_file->f_mapping->host;
+
+ min = vaddr_to_offset(vma, start);
+ max = min + (end - start) - 1;
+
+ spin_lock(&uprobes_treelock);
+ n = find_node_in_range(inode, min, max);
+ spin_unlock(&uprobes_treelock);
+
+ return !!n;
}
/*
@@ -1114,41 +1060,18 @@ int uprobe_mmap(struct vm_area_struct *vma)
*/
void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end)
{
- struct list_head tmp_list;
- struct uprobe *uprobe, *u;
- struct inode *inode;
-
if (!atomic_read(&uprobe_events) || !valid_vma(vma, false))
return;
- if (!atomic_read(&vma->vm_mm->uprobes_state.count))
+ if (!atomic_read(&vma->vm_mm->mm_users)) /* called by mmput() ? */
return;
- inode = vma->vm_file->f_mapping->host;
- if (!inode)
+ if (!test_bit(MMF_HAS_UPROBES, &vma->vm_mm->flags) ||
+ test_bit(MMF_RECALC_UPROBES, &vma->vm_mm->flags))
return;
- INIT_LIST_HEAD(&tmp_list);
- mutex_lock(uprobes_mmap_hash(inode));
- build_probe_list(inode, &tmp_list);
-
- list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
- loff_t vaddr;
-
- list_del(&uprobe->pending_list);
- vaddr = vma_address(vma, uprobe->offset);
-
- if (vaddr >= start && vaddr < end) {
- /*
- * An unregister could have removed the probe before
- * unmap. So check before we decrement the count.
- */
- if (is_swbp_at_addr(vma->vm_mm, vaddr) == 1)
- atomic_dec(&vma->vm_mm->uprobes_state.count);
- }
- put_uprobe(uprobe);
- }
- mutex_unlock(uprobes_mmap_hash(inode));
+ if (vma_has_uprobes(vma, start, end))
+ set_bit(MMF_RECALC_UPROBES, &vma->vm_mm->flags);
}
/* Slot allocation for XOL */
@@ -1250,13 +1173,15 @@ void uprobe_clear_state(struct mm_struct *mm)
kfree(area);
}
-/*
- * uprobe_reset_state - Free the area allocated for slots.
- */
-void uprobe_reset_state(struct mm_struct *mm)
+void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
{
- mm->uprobes_state.xol_area = NULL;
- atomic_set(&mm->uprobes_state.count, 0);
+ newmm->uprobes_state.xol_area = NULL;
+
+ if (test_bit(MMF_HAS_UPROBES, &oldmm->flags)) {
+ set_bit(MMF_HAS_UPROBES, &newmm->flags);
+ /* unconditionally, dup_mmap() skips VM_DONTCOPY vmas */
+ set_bit(MMF_RECALC_UPROBES, &newmm->flags);
+ }
}
/*
@@ -1378,9 +1303,6 @@ void uprobe_free_utask(struct task_struct *t)
{
struct uprobe_task *utask = t->utask;
- if (t->uprobe_srcu_id != -1)
- srcu_read_unlock_raw(&uprobes_srcu, t->uprobe_srcu_id);
-
if (!utask)
return;
@@ -1398,7 +1320,6 @@ void uprobe_free_utask(struct task_struct *t)
void uprobe_copy_process(struct task_struct *t)
{
t->utask = NULL;
- t->uprobe_srcu_id = -1;
}
/*
@@ -1417,7 +1338,6 @@ static struct uprobe_task *add_utask(void)
if (unlikely(!utask))
return NULL;
- utask->active_uprobe = NULL;
current->utask = utask;
return utask;
}
@@ -1479,41 +1399,93 @@ static bool can_skip_sstep(struct uprobe *uprobe, struct pt_regs *regs)
return false;
}
+static void mmf_recalc_uprobes(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!valid_vma(vma, false))
+ continue;
+ /*
+ * This is not strictly accurate, we can race with
+ * uprobe_unregister() and see the already removed
+ * uprobe if delete_uprobe() was not yet called.
+ */
+ if (vma_has_uprobes(vma, vma->vm_start, vma->vm_end))
+ return;
+ }
+
+ clear_bit(MMF_HAS_UPROBES, &mm->flags);
+}
+
+static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
+{
+ struct mm_struct *mm = current->mm;
+ struct uprobe *uprobe = NULL;
+ struct vm_area_struct *vma;
+
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, bp_vaddr);
+ if (vma && vma->vm_start <= bp_vaddr) {
+ if (valid_vma(vma, false)) {
+ struct inode *inode = vma->vm_file->f_mapping->host;
+ loff_t offset = vaddr_to_offset(vma, bp_vaddr);
+
+ uprobe = find_uprobe(inode, offset);
+ }
+
+ if (!uprobe)
+ *is_swbp = is_swbp_at_addr(mm, bp_vaddr);
+ } else {
+ *is_swbp = -EFAULT;
+ }
+
+ if (!uprobe && test_and_clear_bit(MMF_RECALC_UPROBES, &mm->flags))
+ mmf_recalc_uprobes(mm);
+ up_read(&mm->mmap_sem);
+
+ return uprobe;
+}
+
+void __weak arch_uprobe_enable_step(struct arch_uprobe *arch)
+{
+ user_enable_single_step(current);
+}
+
+void __weak arch_uprobe_disable_step(struct arch_uprobe *arch)
+{
+ user_disable_single_step(current);
+}
+
/*
* Run handler and ask thread to singlestep.
* Ensure all non-fatal signals cannot interrupt thread while it singlesteps.
*/
static void handle_swbp(struct pt_regs *regs)
{
- struct vm_area_struct *vma;
struct uprobe_task *utask;
struct uprobe *uprobe;
- struct mm_struct *mm;
unsigned long bp_vaddr;
+ int uninitialized_var(is_swbp);
- uprobe = NULL;
bp_vaddr = uprobe_get_swbp_addr(regs);
- mm = current->mm;
- down_read(&mm->mmap_sem);
- vma = find_vma(mm, bp_vaddr);
-
- if (vma && vma->vm_start <= bp_vaddr && valid_vma(vma, false)) {
- struct inode *inode;
- loff_t offset;
-
- inode = vma->vm_file->f_mapping->host;
- offset = bp_vaddr - vma->vm_start;
- offset += (vma->vm_pgoff << PAGE_SHIFT);
- uprobe = find_uprobe(inode, offset);
- }
-
- srcu_read_unlock_raw(&uprobes_srcu, current->uprobe_srcu_id);
- current->uprobe_srcu_id = -1;
- up_read(&mm->mmap_sem);
+ uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
if (!uprobe) {
- /* No matching uprobe; signal SIGTRAP. */
- send_sig(SIGTRAP, current, 0);
+ if (is_swbp > 0) {
+ /* No matching uprobe; signal SIGTRAP. */
+ send_sig(SIGTRAP, current, 0);
+ } else {
+ /*
+ * Either we raced with uprobe_unregister() or we can't
+ * access this memory. The latter is only possible if
+ * another thread plays with our ->mm. In both cases
+ * we can simply restart. If this vma was unmapped we
+ * can pretend this insn was not executed yet and get
+ * the (correct) SIGSEGV after restart.
+ */
+ instruction_pointer_set(regs, bp_vaddr);
+ }
return;
}
@@ -1531,7 +1503,7 @@ static void handle_swbp(struct pt_regs *regs)
utask->state = UTASK_SSTEP;
if (!pre_ssout(uprobe, regs, bp_vaddr)) {
- user_enable_single_step(current);
+ arch_uprobe_enable_step(&uprobe->arch);
return;
}
@@ -1540,17 +1512,15 @@ cleanup_ret:
utask->active_uprobe = NULL;
utask->state = UTASK_RUNNING;
}
- if (uprobe) {
- if (!(uprobe->flags & UPROBE_SKIP_SSTEP))
+ if (!(uprobe->flags & UPROBE_SKIP_SSTEP))
- /*
- * cannot singlestep; cannot skip instruction;
- * re-execute the instruction.
- */
- instruction_pointer_set(regs, bp_vaddr);
+ /*
+ * cannot singlestep; cannot skip instruction;
+ * re-execute the instruction.
+ */
+ instruction_pointer_set(regs, bp_vaddr);
- put_uprobe(uprobe);
- }
+ put_uprobe(uprobe);
}
/*
@@ -1569,10 +1539,10 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
else
WARN_ON_ONCE(1);
+ arch_uprobe_disable_step(&uprobe->arch);
put_uprobe(uprobe);
utask->active_uprobe = NULL;
utask->state = UTASK_RUNNING;
- user_disable_single_step(current);
xol_free_insn_slot(current);
spin_lock_irq(¤t->sighand->siglock);
@@ -1611,8 +1581,7 @@ int uprobe_pre_sstep_notifier(struct pt_regs *regs)
{
struct uprobe_task *utask;
- if (!current->mm || !atomic_read(¤t->mm->uprobes_state.count))
- /* task is currently not uprobed */
+ if (!current->mm || !test_bit(MMF_HAS_UPROBES, ¤t->mm->flags))
return 0;
utask = current->utask;
@@ -1620,7 +1589,6 @@ int uprobe_pre_sstep_notifier(struct pt_regs *regs)
utask->state = UTASK_BP_HIT;
set_thread_flag(TIF_UPROBE);
- current->uprobe_srcu_id = srcu_read_lock_raw(&uprobes_srcu);
return 1;
}
@@ -1655,7 +1623,6 @@ static int __init init_uprobes(void)
mutex_init(&uprobes_mutex[i]);
mutex_init(&uprobes_mmap_mutex[i]);
}
- init_srcu_struct(&uprobes_srcu);
return register_die_notifier(&uprobe_exception_nb);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 7a1d634..5d0137c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -355,6 +355,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
down_write(&oldmm->mmap_sem);
flush_cache_dup_mm(oldmm);
+ uprobe_dup_mmap(oldmm, mm);
/*
* Not linked in yet - no deadlock potential:
*/
@@ -458,9 +459,6 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
if (retval)
goto out;
-
- if (file && uprobe_mmap(tmp))
- goto out;
}
/* a new mm has just been created */
arch_dup_mmap(oldmm, mm);
@@ -843,8 +841,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
mm->pmd_huge_pte = NULL;
#endif
- uprobe_reset_state(mm);
-
if (!mm_init(mm, tsk))
goto fail_nomem;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index a232bb5..764fcd1 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -33,6 +33,12 @@ static int ptrace_trapping_sleep_fn(void *flags)
}
/*
+ * This is declared in linux/regset.h and defined in machine-dependent
+ * code. We put the export here to ensure no machine forgets it.
+ */
+EXPORT_SYMBOL_GPL(task_user_regset_view);
+
+/*
* ptrace a task: make the debugger its new parent and
* move it to the ptrace list.
*
diff --git a/mm/mmap.c b/mm/mmap.c
index 3edfcdf..f25fd3f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1355,9 +1355,8 @@ out:
} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
make_pages_present(addr, addr + len);
- if (file && uprobe_mmap(vma))
- /* matching probes but cannot insert */
- goto unmap_and_free_vma;
+ if (file)
+ uprobe_mmap(vma);
return addr;
@@ -2345,9 +2344,6 @@ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
security_vm_enough_memory_mm(mm, vma_pages(vma)))
return -ENOMEM;
- if (vma->vm_file && uprobe_mmap(vma))
- return -EINVAL;
-
vma_link(mm, vma, prev, rb_link, rb_parent);
return 0;
}
@@ -2418,9 +2414,6 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
if (new_vma->vm_file) {
get_file(new_vma->vm_file);
- if (uprobe_mmap(new_vma))
- goto out_free_mempol;
-
if (vma->vm_flags & VM_EXECUTABLE)
added_exe_file_vma(mm);
}
10 years, 8 months
Fedora Kernel Meeting Minutes Sep 14, 2012
by Josh Boyer
==============================
#fedora-meeting: Fedora Kernel
==============================
Meeting started by jwb at 18:00:09 UTC. The full logs are available at
http://meetbot.fedoraproject.org/fedora-meeting/2012-09-14/fedora-kernel....
.
Meeting summary
---------------
* roll call (jwb, 18:00:30)
* F18 (jwb, 18:01:45)
* F18 Alpha using 3.6-rc2 kernel. Will get -rc5/-rc6 pushed out soon
(jwb, 18:05:08)
* F18 GA will stick with the 3.6 release (jwb, 18:05:19)
* Kernel Summit/Linux Plumbers overview (jwb, 18:06:17)
* ACTION: jwb to clean up and send new modsigning patches to fedora
kernel list after testing (jwb, 18:09:36)
* kernel maintainers will start sending "patches we carry" emails to
the stable tree maintainers (jwb, 18:17:24)
* Open Floor (jwb, 18:18:38)
* we could still use active testing on older Fedora releases (jwb,
18:24:22)
Meeting ended at 18:44:12 UTC.
Action Items
------------
* jwb to clean up and send new modsigning patches to fedora kernel list
after testing
Action Items, by person
-----------------------
* jwb
* jwb to clean up and send new modsigning patches to fedora kernel
list after testing
* **UNASSIGNED**
* (none)
People Present (lines said)
---------------------------
* jwb (56)
* brunowolff (21)
* davej (11)
* jforbes (7)
* zodbot (3)
* codemaniac (1)
* adamw (1)
Generated by `MeetBot`_ 0.1.4
.. _`MeetBot`: http://wiki.debian.org/MeetBot
10 years, 8 months